SOLR autoCommit vs autoSoftCommit

Question

I'm very confused about and . Here is what I understand

autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.
autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.

My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?

<updateHandler class="solr.DirectUpdateHandler2">
    <updateLog>
        <str name="dir">${solr.ulog.dir:}</str>
    </updateLog>
   <autoSoftCommit>
        <maxDocs>1000</maxDocs>
        <maxTime>1200000</maxTime>
    </autoSoftCommit>
    <autoCommit>
        <maxDocs>10000</maxDocs>
        <maxTime>120000</maxTime> 
        <openSearcher>false</openSearcher>
    </autoCommit>
</updateHandler>

why is autoCommit working on it's own ?

score 42 · Answer 1 · edited Sep 22 '19 at 22:48

I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.

I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?

We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.

These are places to start.

HEAVY (BULK) INDEXING

The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.

Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher. Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first. Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include: Turning off the tlog completely for the bulk-load operation Indexing offline with some kind of map-reduce process Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog. etc.

INDEX-HEAVY, QUERY-LIGHT

By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.

Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand. Set your hard commit to 15 seconds, openSearcher=false

INDEX-LIGHT, QUERY-LIGHT OR HEAVY

This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update

Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.

INDEX-HEAVY, QUERY-HEAVY

This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start

Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free. Set your hard commit interval to 15 seconds.

In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.

This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.

We don't like link-only answers. Consider posting sufficient information from the link in the answer to make the answer self-contained (not dependent on the link), or posting the link as a comment on the question instead (which you'll be able to do once you get 50 reputation). — Bernhard Barker, Oct 30 '13 at 12:48
The link provided above is really useful if you want to identify in which category your application falls. It will surely help you to fine tune lot of things and in turn will improve performance. — Akshay, May 02 '16 at 22:48
Link seems to be broken. Here is a new one: https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ — alexblum, Oct 19 '16 at 14:18

score 33 · Accepted Answer · answered Jul 16 '13 at 01:24

33

You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.

SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.

This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.

answered Jul 16 '13 at 01:24

Alexandre Rafalovitch

9,709
1
24
27

That makes sense. I guess openSearcher=true is not really required if the documents has already been softCommitted. I'm indexing 500,000 records every 2 hours do you setting softCommit to 3 minutes and autoCommit to 1 hour is going to be a good configuration for production ? – user794783 Jul 16 '13 at 06:34
4

Are you indexing continuously or in a batch? Remember that soft commits have more memory requirements than hard commits (some extra in-memory structures). Either way the soft vs. hard distinction was for those who needed near-real-time visibility of the documents (seconds). If you are operating in minutes, you can probably just stick with hard commits every couple of minutes and not notice the difference. Test it and if you have further questions, ask on the Solr Users mailing list for more advanced help. – Alexandre Rafalovitch Jul 16 '13 at 12:51
Yes, i'm indexing in batch. I'm adding about 500-700 documents every second. The documents are very small in size. I'm not really worried about indexing instantly. But I need it to be indexed at least every 30 minutes. So I will just use with openSearcher=true ? – user794783 Jul 16 '13 at 14:47
1

That's a high throughput AFAIK. You really need to check the mailing list for past discussion of performance, memory and related issues. It is separate from this SO question on the meaning of commits. – Alexandre Rafalovitch Jul 16 '13 at 15:06
What are the tradeoffs of setting openSearcher to true? – freedrull Oct 05 '15 at 03:09
@AlexandreRafalovitch can plz check this https://stackoverflow.com/questions/67506494/solr-cannot-write-to-config-directory-switching-to-in-memory-storage-instead – siva sandeep May 12 '21 at 15:24

nagendra patod · Answer 3 · 2017-12-22T20:09:59.037

Soft commits are about visibility. hard commits are about durability. optimize are about performance.

Soft commits are very fast ,there changes are visible but this changes are not persist (they are only in memory) .So during the crash this changes might be last.
Hard commits changes are persistent to disk.
Optimize is like hard commit but it also merge solr index segments into a single segment for improving performance .But it is very costly.

A commit(hard commit) operation makes index changes visible to new search requests. A hard commit uses the transaction log to get the id of the latest document changes, and also calls fsync on the index files to ensure they have been flushed to stable storage and no data loss will result from a power failure.

A soft commit is much faster since it only makes index changes visible and does not fsync index files or write a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard commit will be lost. Search collections that have NRT requirements (that want index changes to be quickly visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less expensive" in terms of time, but not free, since it can slow throughput.

An optimize is like a hard commit except that it forces all of the index segments to be merged into a single segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined by the merge policy), and optimize just forces these merges to occur immediately.

auto commit properties we can manage from sorlconfig.xml files.
<autoCommit>
       <maxTime>1000</maxTime>
  </autoCommit>


    <!-- SoftAutoCommit

         Perform a 'soft' commit automatically under certain conditions.
         This commit avoids ensuring that data is synched to disk.

         maxDocs - Maximum number of documents to add since the last
                   soft commit before automaticly triggering a new soft commit.

         maxTime - Maximum amount of time in ms that is allowed to pass
                   since a document was added before automaticly
                   triggering a new soft commit.
      -->

     <autoSoftCommit>
       <maxTime>1000</maxTime>
     </autoSoftCommit>

References:

https://wiki.apache.org/solr/SolrConfigXml

https://lucene.apache.org/solr/guide/6_6/index.html

Hi @ MatsLindh, i tried to give answer by using apache solr ref guide.I also updated the answer which clarify differences between soft commit ,hard commit and optimize terminology in solr. Hope you like it. — nagendra patod, Dec 22 '17 at 20:02

SOLR autoCommit vs autoSoftCommit

3 Answers3

Linked