How do I generate a unique id using Lucene?

Question

I am using Lucene to store (as well as index) various documents.

Each document needs a persistent unique identifier (to be used as part of a URL).

If I was using a SQL database, I could use an integer primary key auto_increment (or similar) field to automatically generate a unique id for every record that was added.

Is there any way of doing this with Lucene?

I am aware that documents in Lucene are numbered, but have noted that these numbers are reallocated over time.

(I'm using the Java version of Lucene 3.0.3.)

Cant you just index an UUID.randomUUID() and use it for permanent keys? — sisve, Feb 20 '11 at 22:41

score 4 · Accepted Answer · answered Feb 21 '11 at 06:58

As larsmans said, you need to store this in a separate field. I suggest that you make the field indexed as well as stored, and index it using a KeywordAnalyzer. You can keep a counter in memory and update it for each new document.

What remains is the problem of persistence - how to store the maximal id when the Lucene process stops. One possibility is to use a text file which saves the maximal id.

I believe Flexible Indexing will allow you to add the maximal id to the index as a "global" field. If you are willing to work with Lucene's trunk, you can try flexible indexing to see whether it fits the bill.

score 2 · Answer 2 · edited May 23 '17 at 11:44

For similar situations, I use following algorithm (has nothing to do with Lucene, but you can use it anyway).

Create new AtomicLong. Start with initial value obtained from System.currentTimeMillis() or System.nanoTime()
Each next ID is generated by calling .incrementAndGet or .getAndIncrement on that AtomicLong.
if the system is restarted, AtomicLong is again initialized to current timestamp during the startup.

Pros: simple, effective, thread-safe, non-blocking. If you need clustered id support, just add space for hi/lo algorithm on top of existing long or sacrifice some high bytes.

Cons: does not work if the frequency of adding new entities if more than 1/ms (for System.currentTimeMillis()) or 1/ns (for System.nanoTime()). Does not tolerate clock abnormalities.

Can consider using UUID as yet another alternative. Probability of a duplicate in UUID is virtually non-existant.

score 0 · Answer 3 · answered Dec 28 '15 at 10:52

Try to find a unique value in the data source you are indexing, and store it in the lucene document. A data source could be a mysql database, files from a file system, etc.

For example, if you are indexing content from a mysql database, you can assemble a unique id using the tablename and primary key id "tablename_rowID".

Lets say you are indexing from two tables 'pages' and 'comments' table; for every row in the pages table, you can generate a unique id using "page_28" for row with id 28 in your pages table. Similarly, lets say you index row 36 in comments table, your unique id would be "comment_36".

If all options fail, then I would stick to a UUID. With some additional paranoia, this could be a UUID appended to a timestamp of now().

Fred Foo · Answer 4 · 2011-02-21T16:33:43.607

0

EDIT: Several commenters have raised possible issues with this approach and I don't have time to test it thoroughly. I'm leaving it here because Yuval F. refers to it. Please don't downvote unnecessarily.

Given an IndexWriter w, you can use w.maxDoc() + 1 as an id and store that (as a string) in a separate Field. Make sure the Field is stored.

edited Feb 21 '11 at 16:33

answered Feb 20 '11 at 18:43

Fred Foo

355,277
75
744
836

1

Why store the id field without indexing it? Doesn't that mean I cannot search by id? – dave4420 Feb 20 '11 at 18:57
1

Also, wouldn't this be affected by merging, and reuse ids when deleted documents are pruned? – sisve Feb 20 '11 at 22:39
Excuse me @Dave, misread your question. Of course you can index it if you want. @Simon Svensson: the API docs state "Returns total number of docs in this index (...) not counting deletions". – Fred Foo Feb 21 '11 at 10:14
I don't think this will work. Suppose there are n docs. Add n+1. Delete one. Add another. Now you have two docs with ID n+1. (You'd also get really boned if you merged indexes etc.) – Xodarap Feb 21 '11 at 16:06
@Xodarap: If I read the API doc correctly ("not counting deletions") then this approach does guard against that. In fact, that seems to be why `IndexWriter` has both `maxDoc` and `numDocs` methods. – Fred Foo Feb 21 '11 at 16:35
@larsmans: As Pascal mentions, once you optimize, the seg info no longer contains the count of deleted docs. You can try it with luke: delete, optimize, and then see that the count doesn't include your deletes. – Xodarap Feb 21 '11 at 17:19

How do I generate a unique id using Lucene?

4 Answers4