I have JSON documents that represent database rows.
{"id":3121,"name":"Nikon AF-S DX Nikkor 35 mm", "brand": "Nikon", "price": 456.32}
{"id":3122,"name":"Canon EF-S 55-250 mm", "brand": "Canon", "price": 500.98}
I am trying to index these documents with Lucene.NET. I have created a class that represents these JSON entries.
[<CLIMutable>]
type Lens =
{ Id: int; Name: string; Brand: string; Price: Float}
I can open the JSON entry convert it to a Lens object and then create a Lucene Document.
let getDocument (inputDocument:Lens) =
let id = StoredField("id", inputDocument.Id)
let name = TextField("name", inputDocument.Name, Field.Store.YES)
let brand = StoredField("brand", inputDocument.Brand)
let price = StoredField("price", inputDocument.Price)
let doc = Document()
doc.Add(id)
doc.Add(name)
doc.Add(brand)
doc.Add(price)
// return
doc
This works and I can add the document to the index and search it. The problem starts when I would like to update a document in the index instead of adding it.
let upsertDocument (writer:IndexWriter) (doc:Document) =
try
let id = doc.GetField("id").GetStringValue()
let term = Term("id", id)
writer.UpdateDocument(term, doc)
writer.Flush(triggerMerge = false, applyAllDeletes = false)
Ok "Ok"
with ex ->
logger
<| sprintf "Exception : %s" ex.Message
logger
<| sprintf "Exception : %A" ex.StackTrace
Error ex.Message
The documentation says that I have to create a Term with the id field and use the .GetStringValue() for the term and call .UpdateDocument(term, doc). However, this does not work and a new document is getting added every time I call upsertDocument.
What is the best field type for an int32 database id and how can I use that in an index update operation to overwrite the previous entry?
The complete workflow in a single file:
Core is better as a gist:
https://gist.github.com/l1x/91c36b867acc70e8486a6bce7899332a
Update0: It is kind of funny. It does not reproduce the bug.
Update1: I can reliably reproduce the bug. The key things is calling IndexWriter.commit(). The duplication is visible by inspecting the documents.
Searching for Nikon or Canon results in many documents. These all have the same id.
> searcherz.Search(query, 20).ScoreDocs;;
val it : ScoreDoc [] =
[|doc=5 score=1.1118877 shardIndex=-1 {Doc = 5;
Score = 1.111887693f;
ShardIndex = -1;};
doc=6 score=1.1118877 shardIndex=-1 {Doc = 6;
Score = 1.111887693f;
ShardIndex = -1;};
doc=7 score=1.1118877 shardIndex=-1 {Doc = 7;
Score = 1.111887693f;
ShardIndex = -1;};
doc=8 score=1.1118877 shardIndex=-1 {Doc = 8;
Score = 1.111887693f;
ShardIndex = -1;}|]
Dublicates:
> let hits = searcherz.Search(query, 20).ScoreDocs;;
val hits : ScoreDoc [] =
[|doc=5 score=1.1118877 shardIndex=-1; doc=6 score=1.1118877 shardIndex=-1;
doc=7 score=1.1118877 shardIndex=-1; doc=8 score=1.1118877 shardIndex=-1|]
> hits |> Seq.map (fun hit -> searcherz.Doc(hit.Doc)) |> Seq.map Seq.head;;
val it : seq<IIndexableField> =
seq
[stored<id:3122> {Boost = 1.0f;
FieldType = stored;
IndexableFieldType = stored;
Name = "id";
NumericType = INT32;};
stored<id:3122> {Boost = 1.0f;
FieldType = stored;
IndexableFieldType = stored;
Name = "id";
NumericType = INT32;};
stored<id:3122> {Boost = 1.0f;
FieldType = stored;
IndexableFieldType = stored;
Name = "id";
NumericType = INT32;};
stored<id:3122> {Boost = 1.0f;
FieldType = stored;
IndexableFieldType = stored;
Name = "id";
NumericType = INT32;}]
I am not sure if this is a bug or a feature.
Update2: Code is uploaded to gist.