Solr – incremental commit strategies and configurationPosted by Mark Unsworth on July 7th, 2010 – 5 Comments
We have recently been working on an incremental indexer for our Solr based search implementation, which was being updated sporadically due to the time it took to perform a complete re-index; it was taking about 5 days to create the 13GB of XML, zip, upload to the server, unzip and then re-index.
We have created a Windows service which queries a denormalised data structure using NHibernate. We then use SolrNet to create our Solr documents and push them to the server in batches.
Solr Update Process
When updating a Solr index you post xml documents to your index using the /update url. However these documents are not immediately searchable and need to be made available by using the commit command. This flushes the documents to disk and then starts up and registers a new searcher which is then able to see these changes.
The new documents are committed to the last segment of the index, but over time the performance of the index will degrade as new segments are created by Lucene. The index should be optimised after a large amount of data has been added/changed using the optimize command
We did some tweaking during development to try and find the sweet spot for pushing out updates. In our test environments we found the optimal to be committing batches of 20k documents and calling the optimise command after every 5 batches.
We made the mistake of not initially testing against the full 13Gb index in our test environments so when the service was deployed to live we found that the commits were taking around 4 mins causing the service to timeout and mark the updates as failed in our database. Even though the call from SolrNet timed out, the Solr instance still continues to process the commit successfully, so we used the optional waitflush and waitsearcher attributes so that our service wouldn’t have to wait for a response from Solr.
This caused its own problems as we were then throwing numerous commit commands which were then blocking any further adds from being processed. So we needed to reduce the time taken to perform a commit.
While “Googling” the answers to our problems I came across Jay Hill’s post which details some of the most common pitfalls around using Solr. Some of these stem from using the “out-of-the-box” configuration which we were certainly guilty of doing.
We started by reducing the autoWarmCount on our caches to zero. This specifies how many objects from the currently running searchers cache should be copied to the new searcher.
<filterCache class="solr.LRUCache" size="256" initialSize="128" autowarmCount="0"/>
We took this decision as we maintain our own cache of search results using memcached. This reduced our commit time, but it was still causing us problems when we optimised; running somewhere around 14mins. So we decided to forgo some query performance and increase the maximum number of segments that our index could have by using the optional maxsegments attribute when calling optimize.
<optimize maxSegments="5" />
We also removed the Dismax search & partitioned search requesthandlers as they were never used and removed the listeners from the newSearcher and firstSearcher events as we didn’t need to warm the cache of these searchers any more.
The next step is to upgrade to Solr 1.4 (from 1.3) in a master slave configuration so that we can have a dedicated write instance and allow ourselves to scale horizontally as we move more towards using Solr rather than our RDBMS.