6

By default, Lucene returns the query results in the order of relevance (score). You can pass a sort field (or multiple), then the results get sorted by that field.

I am looking now for a nice solution to get the search results in random order.

The bad approach:
Of course I could take ALL results and then shuffle the collection, but in case of 5 Mio search results, that's not performing well.

The elegant paged approach:
With this approach you would be able to tell Lucene the following:
a) Give me results 1 to 10 out of 5Mio results in random order
b) Then give me 11 to 20 (based on the same random sequence used in a).
c) Just to clarify: If you call a) twice you get the same random elements.

How can you implement this approach??


Update Jul27 2012: Be aware that the solution described here for Lucene 2.9.x is not working properly. Using the RandomOrderScoreDocComparator will result in having certain results twice in the resulting list.

Community
  • 1
  • 1
basZero
  • 4,129
  • 9
  • 51
  • 89

3 Answers3

6

You could write a custom FieldComparator:

public class RandomOrderFieldComparator extends FieldComparator<Integer> {

    private final Random random = new Random();

    @Override
    public int compare(int slot1, int slot2) {
        return random.nextInt();
    }

    @Override
    public int compareBottom(int doc) throws IOException {
        return random.nextInt();
    }

    @Override
    public void copy(int slot, int doc) throws IOException {
    }

    @Override
    public void setBottom(int bottom) {
    }

    @Override
    public void setNextReader(IndexReader reader, int docBase) throws IOException {
    }

    @Override
    public Integer value(int slot) {
        return random.nextInt();
    }

}

This doesn't consume any I/O when shuffling the results. Here is my sample program that demonstrates how you use this:

public static void main(String... args) throws Exception {
    RAMDirectory directory = new RAMDirectory();

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);

    IndexWriter writer = new IndexWriter(
            directory,
            new IndexWriterConfig(Version.LUCENE_33, analyzer).setOpenMode(OpenMode.CREATE_OR_APPEND)
        );

    Document alice = new Document();
    alice.add( new Field("name", "Alice", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( alice );

    Document bob = new Document();
    bob.add( new Field("name", "Bob", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( bob );

    Document chris = new Document();
    chris.add( new Field("name", "Chris", Field.Store.YES, Field.Index.ANALYZED) );
    writer.addDocument( chris );

    writer.close();


    IndexSearcher searcher = new IndexSearcher( directory );



    for (int pass = 1; pass <= 10; pass++) {
        Query query = new MatchAllDocsQuery();

        Sort sort = new Sort(
                new SortField(
                        "",
                        new FieldComparatorSource() {

                            @Override
                            public FieldComparator<Integer> newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException {
                                return new RandomOrderFieldComparator();
                            }

                        }
                    )
            );

        TopFieldDocs topFieldDocs = searcher.search( query, 10, sort );

        System.out.print("Pass #" + pass + ":");
        for (int i = 0; i < topFieldDocs.totalHits; i++) {
            System.out.print( " " + topFieldDocs.scoreDocs[i].doc );
        }
        System.out.println();
    }
}

It yields up this output:

Pass #1: 1 0 2
Pass #2: 1 0 2
Pass #3: 0 1 2
Pass #4: 0 1 2
Pass #5: 0 1 2
Pass #6: 1 0 2
Pass #7: 0 2 1
Pass #8: 1 2 0
Pass #9: 2 0 1
Pass #10: 0 2 1

Bonus! For those of you trapped in Lucene 2

public class RandomOrderScoreDocComparator implements ScoreDocComparator {

    private final Random random = new Random();

    public int compare(ScoreDoc i, ScoreDoc j) {
        return random.nextInt();
    }

    public Comparable<?> sortValue(ScoreDoc i) {
        return Integer.valueOf( random.nextInt() );
    }

    public int sortType() {
        return SortField.CUSTOM;
    }

}

All you have to change is the Sort object:

Sort sort = new Sort(
    new SortField(
        "",
        new SortComparatorSource() {
            public ScoreDocComparator newComparator(IndexReader reader, String fieldName) throws IOException {
                return new RandomOrderScoreDocComparator();
            }
        }
    )
);
Adam Paynter
  • 46,244
  • 33
  • 149
  • 164
  • Does it mean that the 10 top docs you are retrieving from `searcher.search(query,10,sort)` are really randomg from the 5 Mio docs? I think the method signature does not exist in Lucene 2.9.2? But otherwise, the solution looks very interesting... will investigate on this. THANKS! – basZero Aug 29 '11 at 06:37
  • 1
    @basZero: It should shuffle _all_ records, then take the first 10. If you can't get that API to work in 2.9.2, you can take a look at [my previous answer on Lucene sorting](http://stackoverflow.com/questions/817998/how-to-sort-search-results-on-multiple-fields-using-a-weighting-function/825641#825641) (which has sorting code for version 2 and version 3). – Adam Paynter Aug 29 '11 at 08:22
  • Thanks for providing the 2.x version. I will test this within the next few days. I'll update my post accordingly with the results. – basZero Oct 04 '11 at 10:49
  • Hi Adam Paynter : I noticed that if I use your RandomOrderScoreDocComparator , the results get shuffled but some results appear twice in the final results... Do you know why? – basZero Jul 26 '12 at 13:39
  • I noticed that if I use your RandomOrderScoreDocComparator , the results get shuffled but some results appear twice in the final results... Do you know why? – basZero Nov 06 '12 at 09:07
  • I wanted to mention Adam Paynter above, but it did not work... i tried many combinations with the @ character... – basZero Nov 06 '12 at 09:14
  • 1
    @basZero: Hmmm, I am not entirely sure why. I suspect that the algorithm may need to generate *and remember* a random number for each document and compare those random numbers when comparing documents. As it is right now, it could claim that `doc1 > doc2` in one comparison, then change its mind and claim that `doc1 < doc2` in the next comparison. I do not have a lot of time, but perhaps I will take a look at it. – Adam Paynter Nov 06 '12 at 09:58
  • thanks, I will also try to find out more about it and document here if I have something. Looking forward to your results... – basZero Nov 07 '12 at 10:08
  • @ Adam Paynter : just to be clear: I do ONE single search (query) and then I applied your Sort approach. There I have seen some results appearing twice in the results of lucene. – basZero Nov 07 '12 at 10:10
  • @basZero: Thank you for clarifying. That is how I understood your situation. Somehow, you have to make sure that each document's random number is re-computed for every single search. – Adam Paynter Nov 07 '12 at 11:56
  • @ Adam Paynter: Yes, that is ensured as I create a new SearchReader object for every query, so internally also a new ```RandomOrderScoreDocComparator``` will be used. But as said, even with this setup, 1 query contains some documents twice... – basZero Nov 08 '12 at 12:10
  • 1
    This is NOT, NOT a valid solution to do randomized sorting in lucene. Compare sort algorithms require that the comparison value function always returns the same value within a given sort. With the solution here the results may be "undefined", but they are NOT statistically random. If you google "microsoft browser choice shuffle" you'll see how MSFT had the same bug in their browser choice screen. The correct solution is to return a statistically randomized hash of the doc ID in the comparator. – adevine Nov 07 '13 at 22:52
  • Any more ideas to this? I updated the question to clarify it. http://stackoverflow.com/q/7201638/356815 – basZero Feb 03 '14 at 14:45
2

Here's my solution that so far has proven to avoid duplicate results. It's in C# (for Lucene.Net) but I've started it from the Java sample so it should be easily converted back.

The trick I used was a unique ID per search that stays unchanged while the user clicks on pagination. I already had this unique ID that I used for reporting purposes. I initialize it when the user clicks the search button and then I pass it encrypted in the query string. It's finally passed as the seed parameter in RandomOrderFieldComparatorSource (actually it's ID.GetHashCode()).

What this means is I use the same series of random numbers for same search, so each docId gets the same sorting index even when users navigate to other result pages.

One more note, slots could probably be a vector equal to pagesize.

public class RandomOrderFieldComparator : FieldComparator
{
    Random random;
    List<int> slots = new List<int>();
    Dictionary<int, int> docSortIndeces = new Dictionary<int, int>();

    public RandomOrderFieldComparator(int? seed)
    {
        random = seed == null ? new Random() : new Random(seed.Value);
    }

    private int bottom; // Value of bottom of queue

    public override int Compare(int slot1, int slot2)
    {
        // TODO: there are sneaky non-branch ways to compute
        // -1/+1/0 sign
        // Cannot return values[slot1] - values[slot2] because that
        // may overflow
        int v1 = slots[slot1];
        int v2 = slots[slot2];
        if (v1 > v2) {
            return 1;
        } else if (v1 < v2) {
            return -1;
        } else {
            return 0;
        }
    }

    public override int CompareBottom(int doc)
    {
        // TODO: there are sneaky non-branch ways to compute
        // -1/+1/0 sign
        // Cannot return bottom - values[slot2] because that
        // may overflow
        int v2 = GetDocSortingIndex(doc);
        if (bottom > v2) {
            return 1;
        } else if (bottom < v2) {
            return -1;
        } else {
            return 0;
        }
    }

    public override void Copy(int slot, int doc)
    {
        if (slots.Count > slot) {
            slots[slot] = GetDocSortingIndex(doc);
        } else {
            slots.Add(GetDocSortingIndex(doc));
        }
        //slots[slot] = GetDocSortingIndex(doc);
    }

    public override void SetBottom(int bottom)
    {
        this.bottom = slots[bottom];
    }

    public override System.IComparable Value(int slot)
    {
        return (System.Int32)slots[slot];
    }

    public override void SetNextReader(IndexReader reader, int docBase)
    {
    }

    int GetDocSortingIndex(int docId)
    {
        if (!docSortIndeces.ContainsKey(docId))
            docSortIndeces[docId] = random.Next(int.MinValue, int.MaxValue);

        return docSortIndeces[docId];
    }
}
Bogdan Litescu
  • 845
  • 8
  • 8
  • Thanks, but I meant duplicate results after ONE SINGLE query. Pagination is not part of my question. But nice approach for your pagination solution (is probably important when the index changes quite often). – basZero Aug 30 '12 at 13:15
  • Some customers of mine also reported duplicate results so I ended up reimplementing the RandomOrderScoreDocComparator starting from IntComparator from Lucene.Core. I've updated my post with new code. – Bogdan Litescu Jan 25 '13 at 08:11
0

Here is an updated version for Lucene.Net 4.8.

class RandomFieldComparatorSource : FieldComparerSource
    {
        public override FieldComparer NewComparer(string fieldname, int numHits, int sortPos, bool reversed)
        {
            return new RandomOrderFieldComparator(fieldname, null);
        }
    }

    class RandomOrderFieldComparator : NumericComparer<int>
    {
        static Random random = new Random();

        public RandomOrderFieldComparator(string field, int? missingValue) : base(field, missingValue)
        {
        }

        public override int Compare(int slot1, int slot2)
        {
            return random.Next(-1, 2);
        }
        public override int CompareBottom(int doc)
        {
            return random.Next(-1, 2);
        }
        public override void Copy(int slot, int doc)
        {
        }

        public override void SetBottom(int slot)
        {

        }

        public override void SetTopValue(object value)
        {

        }

        public override int CompareTop(int doc)
        {
            return random.Next(-1,2);
        }

        public override IComparable this[int slot] => random.Next();
    }
farlee2121
  • 2,959
  • 4
  • 29
  • 41
  • 1
    Warning: this currently has a bias for low document ids and I'm not exactly sure why. It happens even when I take random values from int.MinValue to int.MaxValue, so it's not entirely an equality issue – farlee2121 Jan 29 '20 at 20:07