Lucene Scoring mechanism

Question

I have 3 product names, they are

Bounty Select-A-Size White Paper Towels 12 Mega Rolls
Bounty Select-A-Size Paper Towels (12 rolls)
Bounty Select-A-Size Paper Towels White 12 Mega Rolls

As you can see, the 1st and 3rd term are the same except the position of word "White". The 2nd term lacks the word "White" and "Mega"

Now, when I run the following code:

public static void main(String[] args) throws IOException, ParseException {
    StandardAnalyzer analyzer = new StandardAnalyzer();

    // 1. create the index
    Directory index = new RAMDirectory();

    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    IndexWriter w = new IndexWriter(index, config);
    addDoc(w, "Bounty Select-A-Size White Paper Towels 12 Mega Rolls");
    addDoc(w, "Bounty Select-A-Size Paper Towels (12 rolls)");
    addDoc(w, "Bounty Select-A-Size Paper Towels White 12 Mega Rolls");
    w.close();

    // 2. query
    String querystr = "Bounty Select-A-Size White Paper Towels 12 Mega Rolls";

    Query q = new QueryParser("title", analyzer).parse(querystr);

    // 3. search
    IndexReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    ScoreDoc[] hits = searcher.search(q, 4).scoreDocs;

    // 4. display results
    System.out.println("Found " + hits.length + " hits.");
    for(int i=0;i<hits.length;++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title") + "\t score " + hits[i].score);
    }

    reader.close();
}

private static void addDoc(IndexWriter w, String title) throws IOException {
    Document doc = new Document();
    doc.add(new TextField("title", title, Field.Store.YES));
    w.addDocument(doc);
}

The result is:

 1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls    score 0.7363191
 2. Bounty Select-A-Size Paper Towels White 12 Mega Rolls    score 0.7363191
 3. Bounty Select-A-Size Paper Towels (12 rolls)     score 0.42395753

so far, so good, the first 2 terms have the same composition, so they score the same.

However, when I extend the number of terms to be searched (same code, but instead of statically input 3, I got about 5000 of them from a file), the scoring changed.

 1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls             4.1677103
 2. Bounty Select-A-Size Paper Towels (12 rolls)                     4.1677103
 3. Bounty Select-A-Size Paper Towels White 12 Mega Rolls            2.874553

My question is:

Is it possible for the score to change this way when data set changed?

If so, how?

If not, then I know there is bug in my code...

As a general rule, Lucene scores across different queries (or the same query over a different data set) are not comparable. If you accept this fact, you and Lucene will be good friends. What's important is that in both cases the two "equivalent" entries got joint first place and the less correct one came third (with roughly 60-70% of the winning score). — biziclop, Feb 23 '16 at 23:14
A comment on my answer brought up that I misread the order of the results in your second result set. My guess was that it was entered incorrectly when typing the question and that results 2 and 3 should be swapped. Is my assumption correct? — femtoRgon, Feb 24 '16 at 09:20
@femtoRgon Thanks to your discussion with Codo, I did find out the bug in my code. You are right, results 2 and 3 should be swapped, and that's not a typo in my question, instead, my bug in code causes this to happen..... What I learned is that: the situation in my question should NEVER occur, if 2 strings are mutation of each other (same element, different position), they should always have the SAME score (td-idf are sum of each element's score, same element, same score). But that score can change when different data set is used. Thank you very much! — user2628641, Feb 24 '16 at 15:34
@user2628641 `if 2 strings are mutation of each other (same element, different position), they should always have the SAME score` Unless you're using proximity searches, see [this question](http://stackoverflow.com/questions/25558195/lucene-proximity-search-for-phrase-with-more-than-two-words) for example. — biziclop, Feb 29 '16 at 13:27
@biziclop Question: if I don't add proximity constrain, just query "white paper towel", then, from my understanding, Lucene will look for all terms containing "white" or "paper" or "towel", and give a td-idf score. So I don't think my statement only apply to proximity search. In my example, after I fix the bug, term 1 and 3 score the same, even though they have a distance of 4. Please correct me if my understanding is wrong. Thanks! — user2628641, Feb 29 '16 at 14:46
@user2628641 No, you are correct. I just added that proximity search is an exception as it CAN return different results for documents that contain the same words in a different order. — biziclop, Feb 29 '16 at 14:59

femtoRgon · Accepted Answer · 2016-02-23T23:08:41.997

1

That's entirely normal, and not at all indicative of a bug in your code.

Scores can change when the contents of your index change, even if those changes don't seem to have much to do with your particular query. Scores are really only valid within the context of the particular search execution, so their absolute value isn't really the important thing, but that the values make sense relative to other results of the query. In both result sets, the first two have equal score, and the other is significantly lower.

The main reason for the change here will be the idf (inverse document frequency) scoring factor. That is intended to weigh more heavily terms that occur less frequently across the entire index, the thinking being that a common term like "the" is less interesting as a search result than a less common one like "geronimo".

In your case, the ratio between your best result and the third result has narrowed a bit, with the rest of the corpus available, so it would seem that "white" and "mega" are more common (and thus, less interesting) terms than some of the other ones in the query.

An additional note: You can use Lucene's IndexSearcher.explain method to get detailed information on why documents score the way they do:

System.out.println(searcher.explain(query, docNumber).toString());

edited Feb 23 '16 at 23:08

answered Feb 23 '16 at 23:03

femtoRgon

32,893
7
60
87

Your answer does not explain why two documents with the same words (in different order) have a different score. That's quite surprising to me and looks like a bug. – Codo Feb 24 '16 at 07:00
What kind of data is part of your 5000 documents, are other docs similar?here looks like "White" is more unique term then "Paper" in corpus ("Paper" is part of almost all documents?) and thats why IDF score of "Paper" is making your 3rd document lesser in score. But do check "searcher.explain(query, docNumber)" of each match document and confirm. – Rushik Feb 24 '16 at 08:31
@Codo - That would seem odd, yes, but is simply not the case being presented by this question. If you have run across this sort of behavior and find it difficult to explain, please do ask your own question. – femtoRgon Feb 24 '16 at 09:06
1

@femtoRgon Are you sure? The question isn't very specific. But as he/she explicitly mentioned the case with same words in different order, this might be what he/she's looking an explanation for. – Codo Feb 24 '16 at 09:10
@Codo - Just looked more closely, and your right, the second result set *does* look that way. I strongly suspect, however, that was a mistake made when typing the question. I've commented on the question, asking for clarification. – femtoRgon Feb 24 '16 at 09:20

Lucene Scoring mechanism

1 Answers1