6

I am using Lucene to store (as well as index) various documents.

Each document needs a persistent unique identifier (to be used as part of a URL).

If I was using a SQL database, I could use an integer primary key auto_increment (or similar) field to automatically generate a unique id for every record that was added.

Is there any way of doing this with Lucene?

I am aware that documents in Lucene are numbered, but have noted that these numbers are reallocated over time.

(I'm using the Java version of Lucene 3.0.3.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
dave4420
  • 46,404
  • 6
  • 118
  • 152

4 Answers4

4

As larsmans said, you need to store this in a separate field. I suggest that you make the field indexed as well as stored, and index it using a KeywordAnalyzer. You can keep a counter in memory and update it for each new document.

What remains is the problem of persistence - how to store the maximal id when the Lucene process stops. One possibility is to use a text file which saves the maximal id.

I believe Flexible Indexing will allow you to add the maximal id to the index as a "global" field. If you are willing to work with Lucene's trunk, you can try flexible indexing to see whether it fits the bill.

Yuval F
  • 20,565
  • 5
  • 44
  • 69
2

For similar situations, I use following algorithm (has nothing to do with Lucene, but you can use it anyway).

  • Create new AtomicLong. Start with initial value obtained from System.currentTimeMillis() or System.nanoTime()
  • Each next ID is generated by calling .incrementAndGet or .getAndIncrement on that AtomicLong.
  • if the system is restarted, AtomicLong is again initialized to current timestamp during the startup.

Pros: simple, effective, thread-safe, non-blocking. If you need clustered id support, just add space for hi/lo algorithm on top of existing long or sacrifice some high bytes.

Cons: does not work if the frequency of adding new entities if more than 1/ms (for System.currentTimeMillis()) or 1/ns (for System.nanoTime()). Does not tolerate clock abnormalities.

Can consider using UUID as yet another alternative. Probability of a duplicate in UUID is virtually non-existant.

Community
  • 1
  • 1
mindas
  • 26,463
  • 15
  • 97
  • 154
0

Try to find a unique value in the data source you are indexing, and store it in the lucene document. A data source could be a mysql database, files from a file system, etc.

For example, if you are indexing content from a mysql database, you can assemble a unique id using the tablename and primary key id "tablename_rowID".

Lets say you are indexing from two tables 'pages' and 'comments' table; for every row in the pages table, you can generate a unique id using "page_28" for row with id 28 in your pages table. Similarly, lets say you index row 36 in comments table, your unique id would be "comment_36".

If all options fail, then I would stick to a UUID. With some additional paranoia, this could be a UUID appended to a timestamp of now().

Basil Musa
  • 8,198
  • 6
  • 64
  • 63
0

EDIT: Several commenters have raised possible issues with this approach and I don't have time to test it thoroughly. I'm leaving it here because Yuval F. refers to it. Please don't downvote unnecessarily.

Given an IndexWriter w, you can use w.maxDoc() + 1 as an id and store that (as a string) in a separate Field. Make sure the Field is stored.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    Why store the id field without indexing it? Doesn't that mean I cannot search by id? – dave4420 Feb 20 '11 at 18:57
  • 1
    Also, wouldn't this be affected by merging, and reuse ids when deleted documents are pruned? – sisve Feb 20 '11 at 22:39
  • Excuse me @Dave, misread your question. Of course you can index it if you want. @Simon Svensson: the API docs state "Returns total number of docs in this index (...) not counting deletions". – Fred Foo Feb 21 '11 at 10:14
  • I don't think this will work. Suppose there are n docs. Add n+1. Delete one. Add another. Now you have two docs with ID n+1. (You'd also get really boned if you merged indexes etc.) – Xodarap Feb 21 '11 at 16:06
  • @Xodarap: If I read the API doc correctly ("not counting deletions") then this approach does guard against that. In fact, that seems to be why `IndexWriter` has both `maxDoc` and `numDocs` methods. – Fred Foo Feb 21 '11 at 16:35
  • @larsmans: As Pascal mentions, once you optimize, the seg info no longer contains the count of deleted docs. You can try it with luke: delete, optimize, and then see that the count doesn't include your deletes. – Xodarap Feb 21 '11 at 17:19