How to re-index documents with integer id?

Question

I have JSON documents that represent database rows.

{"id":3121,"name":"Nikon AF-S DX Nikkor 35 mm", "brand": "Nikon", "price": 456.32}

{"id":3122,"name":"Canon EF-S 55-250 mm", "brand": "Canon", "price": 500.98}

I am trying to index these documents with Lucene.NET. I have created a class that represents these JSON entries.

[<CLIMutable>]
type Lens =
  { Id: int; Name: string; Brand: string; Price: Float}

I can open the JSON entry convert it to a Lens object and then create a Lucene Document.

  let getDocument (inputDocument:Lens) =
    let id = StoredField("id", inputDocument.Id)
    let name  = TextField("name", inputDocument.Name, Field.Store.YES)
    let brand  = StoredField("brand", inputDocument.Brand)
    let price  = StoredField("price", inputDocument.Price)
    let doc = Document()
    doc.Add(id)
    doc.Add(name)
    doc.Add(brand)
    doc.Add(price)
    // return
    doc

This works and I can add the document to the index and search it. The problem starts when I would like to update a document in the index instead of adding it.

let upsertDocument (writer:IndexWriter) (doc:Document) =
      try
        let id = doc.GetField("id").GetStringValue()
        let term = Term("id", id)
        writer.UpdateDocument(term, doc)
        writer.Flush(triggerMerge = false, applyAllDeletes = false)
        Ok "Ok"
      with ex ->
        logger
        <| sprintf "Exception : %s" ex.Message
        logger
        <| sprintf "Exception : %A" ex.StackTrace
        Error ex.Message

The documentation says that I have to create a Term with the id field and use the .GetStringValue() for the term and call .UpdateDocument(term, doc). However, this does not work and a new document is getting added every time I call upsertDocument.

What is the best field type for an int32 database id and how can I use that in an index update operation to overwrite the previous entry?

The complete workflow in a single file:

Core is better as a gist:

https://gist.github.com/l1x/91c36b867acc70e8486a6bce7899332a

Update0: It is kind of funny. It does not reproduce the bug.

Update1: I can reliably reproduce the bug. The key things is calling IndexWriter.commit(). The duplication is visible by inspecting the documents.

Searching for Nikon or Canon results in many documents. These all have the same id.

> searcherz.Search(query, 20).ScoreDocs;;
val it : ScoreDoc [] =
  [|doc=5 score=1.1118877 shardIndex=-1 {Doc = 5;
                                         Score = 1.111887693f;
                                         ShardIndex = -1;};
    doc=6 score=1.1118877 shardIndex=-1 {Doc = 6;
                                         Score = 1.111887693f;
                                         ShardIndex = -1;};
    doc=7 score=1.1118877 shardIndex=-1 {Doc = 7;
                                         Score = 1.111887693f;
                                         ShardIndex = -1;};
    doc=8 score=1.1118877 shardIndex=-1 {Doc = 8;
                                         Score = 1.111887693f;
                                         ShardIndex = -1;}|]

Dublicates:

> let hits = searcherz.Search(query, 20).ScoreDocs;;
val hits : ScoreDoc [] =
  [|doc=5 score=1.1118877 shardIndex=-1; doc=6 score=1.1118877 shardIndex=-1;
    doc=7 score=1.1118877 shardIndex=-1; doc=8 score=1.1118877 shardIndex=-1|]

> hits |> Seq.map (fun hit -> searcherz.Doc(hit.Doc)) |> Seq.map Seq.head;;
val it : seq<IIndexableField> =
  seq
    [stored<id:3122> {Boost = 1.0f;
                      FieldType = stored;
                      IndexableFieldType = stored;
                      Name = "id";
                      NumericType = INT32;};
     stored<id:3122> {Boost = 1.0f;
                      FieldType = stored;
                      IndexableFieldType = stored;
                      Name = "id";
                      NumericType = INT32;};
     stored<id:3122> {Boost = 1.0f;
                      FieldType = stored;
                      IndexableFieldType = stored;
                      Name = "id";
                      NumericType = INT32;};
     stored<id:3122> {Boost = 1.0f;
                      FieldType = stored;
                      IndexableFieldType = stored;
                      Name = "id";
                      NumericType = INT32;}]

I am not sure if this is a bug or a feature.

Update2: Code is uploaded to gist.

I'm confused by your final sentence. The full code you pasted works as you expect? If so, that would seem to solve your problem, right? — Brian Berns, May 08 '21 at 13:57
Update: I think I've reproduced what you're seeing. I don't have any experience with Lucene, but I'll see what I can figure out. — Brian Berns, May 08 '21 at 16:02
It is really weird. The code I have pasted is exactly what we have in production. The production code produces duplicates. The pasted code on my end did not produce duplicates. I need to investigate it further. — Istvan, May 08 '21 at 19:38
Ok, my understanding of the IndexWriter is probably completely off. I am not sure how to have a writer that just writes data to disk and a reader that reads from disk. The writer duplicates data for sure if even when using the UpdateDocument with a term. The reader does not see the disk content if I create it using the writer. — Istvan, May 08 '21 at 20:18
Thanks @brianberns. I do not have too much experience with Lucene. I think I should try to figure out what is the consistency model and how IndexWriter is supposed to be used. — Istvan, May 08 '21 at 20:25
Instead of using the ID as a int I recommend you try having the ide be a string with leading 0's and index that. Then when you do the update will will probably find the original and the new one will replace it. so for example instead of ID = 1 make ID = "000001" — RonC, May 08 '21 at 23:19
Also you have another issue, you have the ID as a stored field and stored fields are only stored but not indexed. Therefore any search on the field won't return the record, so neither will the update. But if you do the id 0 padded string and add it to the doc with a `StringField` it will get indexed and should work. One parameter of `StringField` is to store it which you can also do so that when you read the record back it will be present in the read document. — RonC, May 08 '21 at 23:23

Brian Berns · Accepted Answer · 2021-05-09T16:17:28.600

OK, I think I figured this out. The problem is that you can't create a valid Term by using GetStringValue to convert the integer ID to a string (e.g. "3122"). Instead you have to create the term from the ID's raw bytes (e.g. [60 8 0 0 18 31]), like this:

open Lucene.Net.Util

let id = doc.GetField("id").GetInt32Value().Value
let bytes = BytesRef(NumericUtils.BUF_SIZE_INT32)
NumericUtils.Int32ToPrefixCodedBytes(id, 0, bytes)
let term = Term("id", bytes)

After making this change, I no longer see duplicate documents in the index. See this SO question for more info.

How to re-index documents with integer id?

1 Answers1