Posts Tagged ‘solr’

Comparing Solr Response Sizes

Posted in Solr, SolrNet on March 9th, 2012 by gregsochanik – Be the first to comment

After seeing some relative success in our Solr implementations xml response times by switching on Tomcats http gzip compression, I’ve been doing some comparisons between the other formats solr can return.

We use Solrnet, an excellent open source .NET Solr client. At the moment, it only supports xml responses, but every request sends the “Accept-encoding:gzip” header as standard, so all you have to do is switch it on on your server and you’ve got some nicely compressed responses. There is talk of supporting javabin de-serialisation, but it’s not there yet.

I’ve decided to compare the following using curl with 1000 rows and 10000 rows in json, javabin, json/gzip compressed and javabin/gzip compressed. My test setup is a solr 1.4 instance with around 11000 records in sitting behind an nginx reverse proxy handling the gzip compression. As I said, this could easily be achieved by switching on gzip compression in Apache Tomcat.

This picture shows the results.

As you can see the same 10000 records, returned using the q=: directive with wt=json when http gzip compressed is the smallest, but only marginally, compared to wt=javabin. It would seem that json compresses very well indeed. You can also see the massive drop just switching on gzip compression gives to xml.

My conclusion to this would be that because json is a widely accepted content-type, with many well known and fast de-serialising libraries, it would probably be worth implementing that rather than trying to de-serialise javabin. But this was only a quick test and does’t take into account how quickly solr handles serialisation of the documents server-side.

Search

Posted in API, Search, Solr, SolrNet on October 13th, 2011 by Mark Unsworth – 2 Comments

We will be the first to admit that our search has been far from optimal for some time, it’s something that’s frustrated us as much as it has our users. Unfortunately the unprecedented growth of 7digital has taken its toll on the original search infrastructure that powered our platform for the last 7 years – that’s right we were 7 this year.

A few weeks back we quietly made some changes to the artist and release search. These changes have been in the works for several months and has improved the quality of our search results as well as the speed in which those results are returned. Alongside the improvements to quality and speed we are also now returning accurate pricing for releases across all of of our catalogue, something that we haven’t been able to do previously.

Architecture

The main reason for the improvements has been our move away from using SQL Server Full Text search to using the open source Solr search platform.  Solr is a super fast open source search server built on top of the Lucene search library. We’re also using SolrNet to be able to index and query Solr from our .NET codebase – more on our SolrNet usage here.

We have a master-slave set up where we index all of our documents (~40m) to a single write-only master. This is then set to replicate out to a number of read-only slaves. We aren’t currently sharding the data across the slaves so they are exact mirrors with HAProxy in front of them to balance the load.

Load Balancing

We originally went with a round-robin approach to load distribution but realised that we were potentially caching the same query on each of the slaves so used the balance url_param feature of HAProxy. This means that the same query is always requested from the same slave. Average query times were reduced by 50% from this change alone. The graph below shows the avg response time dropping off and stabilising once the change had been made.

Improving Quality

We haven’t had to do a great deal to improve the relevance of the results returned as Solr gives you this for free, but we have been investing time in looking at the ways our users are searching and seeing what we’re missing from our index. Better logging of search requests should allow us to be able to understand more about where customers are not finding what they are looking for. We’ll blog more about this when we start work on it.

Speed Improvements

Our average search response times are now currently less than 200ms. This is a significant improvement from the days of SQL Server Full Text search when the average query time via our API was around 2 seconds and also on the initial implementation of search on top Solr which had average query times of around 500ms.

The image below, taken from our New Relic dashboard for our Search API, shows the last months stats for the Search API. The left hand chart shows the average reponse time (lower is better), the top right shows the Apdex (performance) score (higher is better) and the bottom right shows the amount of requests per minute we are seeing.

API Search Traffic 4/9 - 4/10

To put this into perspective, if you search for ‘Lady Antebellum’ on Google it takes around 200 milliseconds, but through our API it only takes 58ms  - ok so Google do return a result set of 54 million pages but they don’t show our artist page at the top!

Future Plans

We will be making more improvements to search over the coming weeks and months including a long awaited update to the track based search.