Solr – incremental commit strategies and configuration
Posted by Mark Unsworth on July 7th, 2010 – 5 CommentsWe have recently been working on an incremental indexer for our Solr based search implementation, which was being updated sporadically due to the time it took to perform a complete re-index; it was taking about 5 days to create the 13GB of XML, zip, upload to the server, unzip and then re-index.
We have created a Windows service which queries a denormalised data structure using NHibernate. We then use SolrNet to create our Solr documents and push them to the server in batches.
Solr Update Process
When updating a Solr index you post xml documents to your index using the /update url. However these documents are not immediately searchable and need to be made available by using the commit command. This flushes the documents to disk and then starts up and registers a new searcher which is then able to see these changes.
The new documents are committed to the last segment of the index, but over time the performance of the index will degrade as new segments are created by Lucene. The index should be optimised after a large amount of data has been added/changed using the optimize command
Commit Strategies
We did some tweaking during development to try and find the sweet spot for pushing out updates. In our test environments we found the optimal to be committing batches of 20k documents and calling the optimise command after every 5 batches.
We made the mistake of not initially testing against the full 13Gb index in our test environments so when the service was deployed to live we found that the commits were taking around 4 mins causing the service to timeout and mark the updates as failed in our database. Even though the call from SolrNet timed out, the Solr instance still continues to process the commit successfully, so we used the optional waitflush and waitsearcher attributes so that our service wouldn’t have to wait for a response from Solr.
This caused its own problems as we were then throwing numerous commit commands which were then blocking any further adds from being processed. So we needed to reduce the time taken to perform a commit.
Server Configuration
While “Googling” the answers to our problems I came across Jay Hill’s post which details some of the most common pitfalls around using Solr. Some of these stem from using the “out-of-the-box” configuration which we were certainly guilty of doing.
We started by reducing the autoWarmCount on our caches to zero. This specifies how many objects from the currently running searchers cache should be copied to the new searcher.
<filterCache class="solr.LRUCache" size="256" initialSize="128" autowarmCount="0"/> |
We took this decision as we maintain our own cache of search results using memcached. This reduced our commit time, but it was still causing us problems when we optimised; running somewhere around 14mins. So we decided to forgo some query performance and increase the maximum number of segments that our index could have by using the optional maxsegments attribute when calling optimize.
<optimize maxSegments="5" /> |
We also removed the Dismax search & partitioned search requesthandlers as they were never used and removed the listeners from the newSearcher and firstSearcher events as we didn’t need to warm the cache of these searchers any more.
The next step is to upgrade to Solr 1.4 (from 1.3) in a master slave configuration so that we can have a dedicated write instance and allow ourselves to scale horizontally as we move more towards using Solr rather than our RDBMS.
Did you need to do optimizes after every five batches in order to ensure the segments got cleaned up and didn’t proliferate and run out of space? I would have thought that doing an optimize at the very end, even though maybe slower, would be faster then doing multiple optimizes?
Also, I know the HathiTrust did some work around optimization, and found that optimizing first to 16 max segments, and then 8, and then 4 segments worked better then trying to do it all at once!
It was purely due to the speed of the adds/commits increasing over time, which we found was reduced when we optimised. We should revisit this and do some more benchmarking with a full index as I think it might not be as necessary to optimise so often now that our commit times are almost nothing .
Saying that, the problem of really large update batches really only comes about when we’re initialising the index from scratch, or updating the entire index, which shouldn’t really be required again now that we have our incremental indexer running.
in solr-1.3,how to update index in incremental way? not update all index once.after googling ,i found nothing.
in solr1.3
how to update index in incremental way? not update all the index once.
I guess it depends on your own specific setup but we have implemented a Windows service that polls a queue for jobs – our queue is implemented as a simple db table at the moment.
The jobs describe what needs to be updated/deleted from our index so that the service can also retrieve the required data for the document. The service then uses the Add(IEnumerable docs) method to send batches to the Solr instance.
We populate jobs in the queue when the data changes in our database – so for example, if a new product is added to 7digital.com we just add an insert job for that product.
The service can currently proceses about 40,000 document changes per minute although we’re experimenting with multithreading/paralells to speed this up – if we need to.