I have 3 product names, they are
- Bounty Select-A-Size White Paper Towels 12 Mega Rolls
- Bounty Select-A-Size Paper Towels (12 rolls)
- Bounty Select-A-Size Paper Towels White 12 Mega Rolls
As you can see, the 1st and 3rd term are the same except the position of word "White". The 2nd term lacks the word "White" and "Mega"
Now, when I run the following code:
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Bounty Select-A-Size White Paper Towels 12 Mega Rolls");
addDoc(w, "Bounty Select-A-Size Paper Towels (12 rolls)");
addDoc(w, "Bounty Select-A-Size Paper Towels White 12 Mega Rolls");
w.close();
// 2. query
String querystr = "Bounty Select-A-Size White Paper Towels 12 Mega Rolls";
Query q = new QueryParser("title", analyzer).parse(querystr);
// 3. search
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
ScoreDoc[] hits = searcher.search(q, 4).scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title") + "\t score " + hits[i].score);
}
reader.close();
}
private static void addDoc(IndexWriter w, String title) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
w.addDocument(doc);
}
The result is:
1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls score 0.7363191
2. Bounty Select-A-Size Paper Towels White 12 Mega Rolls score 0.7363191
3. Bounty Select-A-Size Paper Towels (12 rolls) score 0.42395753
so far, so good, the first 2 terms have the same composition, so they score the same.
However, when I extend the number of terms to be searched (same code, but instead of statically input 3, I got about 5000 of them from a file), the scoring changed.
1. Bounty Select-A-Size White Paper Towels 12 Mega Rolls 4.1677103
2. Bounty Select-A-Size Paper Towels (12 rolls) 4.1677103
3. Bounty Select-A-Size Paper Towels White 12 Mega Rolls 2.874553
My question is:
Is it possible for the score to change this way when data set changed?
If so, how?
If not, then I know there is bug in my code...