Sort Index using DocValues for integers?

Question

I am using Lucene for an autocomplete mechanism of an textfield supporting multiple languages and multiple groups of options. Each group has about 2k to 5k different values.

Currently I query all hits and sort those according to an integer value by hand. Since this is inefficient, I need to create an index using doc-values. I understand the theory but I cant find a good code snippet to make it work. I brought and read in two books and it is either not or poorly covered (one small section with one line of code).

My goal is to index an integer value per document and sort in descending order.

Also I would like to ask if I miss a mayor documentation source? The Lucene documentation is not that comprehensive nor accessible. I used to use Lucene in Action but this book is a decade old and the most recent changes in Lucene are quite dramatic in terms of API.

As an example:

{name:"A1", number:1000}
{name:"A2", number:1001}
{name:"A3", number:990}
{name:"B1", number:300}

= Query: A* + sorted by number + top2 => A3, A1

Summary: I currently fetch all the documents and do the sorting and trimming (limit) in code and would rather like Lucene to do it.

The implementation uses Java. Since I use only a small set of information but in multiple languages I create an index using RAMDirectory (yes I know its deprecated but it works) and add each document to a standard index writer using a standard analyzer.

As far as I manage to understand the requirements, I need to define and use a field stored in a column to allow sorting with Lucene. I tried multiple hours and just gave up due fetching all information and look up the data in memory and sort+trim it did the trick but it is dissatisfactory.

So all what it is needed is add a integer field to the index allowing for sorting in lucene.

Can you be a bit more specific about what you'd like the code example to demonstrate? — RonC, Mar 22 '21 at 13:32
Which language/library are you using ? How do you index/query documents for now ? — EricLavault, Mar 23 '21 at 11:55
Okay thinking that lucene is a java library but I see how one can use different languages/implementations. I will add more details. — Martin Kersten, Mar 23 '21 at 12:16
Thanks Stackoverflow to not being allowed to edit comments.. NewVersion: Well lucene is a java library but I see how one can use different languages/implementations and also missing the Java tag... . I added more details. Thanks for the hint! — Martin Kersten, Mar 23 '21 at 12:22
@MartinKersten StackOverflow only allowed editing a comment for a few minutes after it's made. There is a pretty short time limit. Thanks for the updated post. — RonC, Mar 23 '21 at 12:36
@RonC It told me I had 10 seconds when I wanted to update the content after 4min. Since noone else answered, I think its to rigid. — Martin Kersten, Mar 23 '21 at 23:17

RonC · Accepted Answer · 2021-04-20T19:18:48.707

You are right, documentation for Lucene can be a bit challenging because the last revision of the book Lucene In Action was for version 3.0 and there were very major changes made in Lucene 4.0. I have found one book that covers Lucene 4 called Lucene 4 Cookbook but it's not too think and only it's coverage of sort is limited to a single page, but it does provide an example.

One great source for learning about Lucene is the unit tests stored with the project. This is where I found the example below. This example shows how store your number as a NumericDocValue and then sort by it. Unit tests are not typicaly suitable for cut and paste app use but they do a good job of showing how the feature us used. So for example this unit test uses a RandomIndexWriter whereas you'd use a IndexWriter.

This sorting approach leverages DocValues. One thing to remember about DocValues is that they are not stored with the document but rather are stored together by DocValue field. This is what makes them especially suitable for sorting. But when you read back the document it won't be be one of the fields unless you also stored the value as a field in the document. This is why the example stores the value twice, once as a NumericDocValuesField and once as a StringField

  
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

 /** Tests sorting on type int */
  public void testInt() throws IOException {
    Directory dir = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random(), dir);

    Document doc = new Document();
    doc.add(new NumericDocValuesField("value", 300000));
    doc.add(newStringField("value", "300000", Field.Store.YES));
    writer.addDocument(doc);
    
    doc = new Document();
    doc.add(new NumericDocValuesField("value", -1));
    doc.add(newStringField("value", "-1", Field.Store.YES));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new NumericDocValuesField("value", 4));
    doc.add(newStringField("value", "4", Field.Store.YES));
    writer.addDocument(doc);
    
    IndexReader ir = writer.getReader();
    writer.close();
    
    IndexSearcher searcher = newSearcher(ir);
    Sort sort = new Sort(new SortField("value", SortField.Type.INT));

    TopDocs td = searcher.search(new MatchAllDocsQuery(), 10, sort);
    assertEquals(3, td.totalHits.value);
    // numeric order
    assertEquals("-1", searcher.doc(td.scoreDocs[0].doc).get("value"));
    assertEquals("4", searcher.doc(td.scoreDocs[1].doc).get("value"));
    assertEquals("300000", searcher.doc(td.scoreDocs[2].doc).get("value"));

    ir.close();
    dir.close();
  }

source: Lucene Unit Test on GitHub

Unfortunately I'm a c# developer rather then a Java developer so it's a little hard for me to write for you a closer example to what you are asking for using java since I don't yet have an easy way to test Java Lucene code. But I have provided a C# example below that used LuceneNet which I think you will find very easy to translate to Java.

public void NumericDocValueSort() {

            Analyzer standardAnalyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
            Directory indexDir = new RAMDirectory();
            IndexWriterConfig iwc = new IndexWriterConfig(LuceneVersion.LUCENE_48, standardAnalyzer);

            IndexWriter indexWriter = new IndexWriter(indexDir, iwc);

            Document doc = new Document();

            doc.Add(new TextField("name", "A1", Field.Store.YES));
            //doc.Add(new StoredField("number", 1000L));              //uncomment this line to optionally be able to retrieve it from the doc later, can be  done for every doc
            doc.Add(new NumericDocValuesField("number", 1000L));
            indexWriter.AddDocument(doc);

            doc.Fields.Clear();
            doc.Add(new TextField("name", "A2", Field.Store.YES));
            doc.Add(new NumericDocValuesField("number", 1001L));
            indexWriter.AddDocument(doc);

            doc.Fields.Clear();
            doc.Add(new TextField("name", "A3", Field.Store.YES));
            doc.Add(new NumericDocValuesField("number", 990L));
            indexWriter.AddDocument(doc);

            doc.Fields.Clear();
            doc.Add(new TextField("name", "A4", Field.Store.YES));
            doc.Add(new NumericDocValuesField("number", 300L));
            indexWriter.AddDocument(doc);

            indexWriter.Commit();

            IndexReader reader = indexWriter.GetReader(applyAllDeletes: true);
            IndexSearcher searcher = new IndexSearcher(reader);

            Sort sort;
            TopDocs docs;
            SortField sortField = new SortField("number", SortFieldType.INT64);
            sort = new Sort(sortField);

            docs = searcher.Search(new MatchAllDocsQuery(), 1000, sort);
            

            foreach (ScoreDoc scoreDoc in docs.ScoreDocs) {
                Document curDoc = searcher.Doc(scoreDoc.Doc);
                string name = curDoc.Get("name");
            }

            reader.Dispose();               //reader.close() in java
        }

I ran this code on my machine and it returns the docs in the for loop in the proper number order. Note that the reason I use NumericDocValuesField rather then SortedNumericSortField is because the later is only needed if a single document contains multiple values for the field. Your example did not, so NumericDocValuesField is the one you want in that case.

People are often confused by the word Sorted in the name SortedNumericSortField. In this context it means that if the field contains multiple values for that field in the document those values will be listed in the document's field in sorted order. It has nothing to do with the idea of needing the documents in sorted order. Yah, I know, not the best naming approach, kinda confusing. Anyway, hopefully that solves it for you.

Thanks for your effort RonC. I tried it myself when I developed the service. I got everything working quickly beside the column index field sorting thingy. I tried for two hours and gave up. I own both books you mentioned myself. Not much look. You find a lot from 2015 and before meaning Lucene 3.0 at most but today we use Lucene 7 or 8 and they blew up the API multiple time since. I will test your solution and if I get it working, I will award you the bounty. — Martin Kersten, Mar 23 '21 at 23:23

score 0 · Answer 2 · answered Mar 23 '21 at 12:52

0

Use a SortedNumericDocValuesField to add the field to your documents.

Use a SortedNumericSortField with the same name in your search query:

Sort sort = new Sort(new SortedNumericSortField("number", SortField.Type.LONG, true));
TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100, sort);

See this related question: How to sort Numeric field in Lucene 6

TBH I just googled for your use case.

answered Mar 23 '21 at 12:52

chromanoid

545
3
14

I was aware of this but failed to create the index and gave up since I had a working solution but of cause using sort + limit instead of fetching all results is not that elegant. – Martin Kersten Mar 23 '21 at 23:23

Sort Index using DocValues for integers?

2 Answers2