1

Say I update my index once a day, everyday, at the same time. During the time between updates (for 21 hours or so), will the docids remain constant?

PSK
  • 347
  • 2
  • 13
  • 1
    Not an answer, just a pointer, in case you have not already seen it: The only place I have seen a discussion of DocIDs is in the [index package documentation](https://lucene.apache.org/__root/docs.lucene.apache.org/content/core/8_8_0/core/org/apache/lucene/index/package-summary.html#segments), where segments and DocIDs are discussed. It concludes: _"docid values must always be treated as internal implementation, not exposed as part of an application, nor stored or referenced outside of Lucene's internal APIs"_. – andrewJames Feb 01 '21 at 13:37
  • 1
    Having said that, outside of segment merges, the IDs do appear to be stable. But when exactly do merges get triggered? Immediately when a segment gets large enough? As a background process, the timing of which you cannot predict? – andrewJames Feb 01 '21 at 13:39
  • Thanks @andrewjames I have read the documentation and I was hoping someone would have an answer to your question in the second comment. I have posted another [question](https://stackoverflow.com/questions/66001866/how-to-group-results-of-a-lucene-query-count-the-hits-by-group-and-highlight-th) which is what I am trying to achieve. It would be great if you have an answer to that. – PSK Feb 02 '21 at 00:08

1 Answers1

1

As @andrewjames mentioned, the docId's only change when a merge happens. The docsId is basically the array index position of the doc in a particular segment.

The side effect of that is also that if you have multiple segments, then a given docId might be assigned to multiple docs, one in one segment, one in another segment, etc. If that's a problem, you can do a force merge once you are done building your index so that there is only a single segment. Then no two docs will have the same docId at that point.

The docId for a given document will not change if a merge does not happen. And a merge won't happen unless you call force merge or add or delete documents, or upgrade your index.

So...if you build your index, and don't add docs, delete docs, or call force merge, or upgrade your index then the docIds will be stable. But the next time you build your index, a give doc may receive a totally different doc Id. And as @andrewjames said, the docId assignments and timing of assignments are an internal affair in Lucene, so you sould be cautious about relying on them even when you know when and how they are currently assigned.

RonC
  • 31,330
  • 19
  • 94
  • 139
  • That's helpful, thanks! I've decided to work around using docids. Can you please tell me what upgrading an index means? Are you referring to the version of Lucene? Also if possible could you please provide a link to an example or documentation of force merge? – PSK Feb 04 '21 at 08:52
  • 1
    Yep, by "upgrading the index" I mean using the index with a version of Lucene that is at least one major version larger than the version used to create the index. I googled around but didn't see any good code examples of force merge but it's basically just a method on the `IndexWriter` class. source code here: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2089 you just call it like `indexWriter.forceMerge(maxNumSegments, doWait);` – RonC Feb 04 '21 at 13:50