I want to use Lucene (Version 4.10) to process several millions of news data. I'm quite new to Lucene, so I'm trying to learn more and more about how it is working. In every lucene document I store one news article. Every article has of course its content (field is called "TextContent").
I create the field like this (related to this stackoverflow question):
/* Indexed, tokenized, stored. */
public static final FieldType TYPE_STORED = new FieldType();
static {
TYPE_STORED.setIndexed(true);
TYPE_STORED.setTokenized(true);
TYPE_STORED.setStored(true);
TYPE_STORED.setStoreTermVectors(true);
TYPE_STORED.setStoreTermVectorPositions(true);
TYPE_STORED.freeze();
}
doc.add(new Field("TextContent", oneArticle.getTextContent(), TYPE_STORED));
I do it like this, because I want to have the text contents term vectors saved as well (for the creation of phrase queries, so I can for example easily retrieve the term vector of one news article and search for with its contents for other related articles).
I now want to search for one or several words (combined with the boolean clauses Occur.SHOULD or MUST)
My code looks like this (words is simply a List containing all terms to search for)
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(PATH_TO_INDEX)));
IndexSearcher searcher = new IndexSearcher(reader);
BooleanQuery booleanQuery = new BooleanQuery();
//words is simply a List<String> containing all terms to search for
for (String word : words) {
PhraseQuery query = new PhraseQuery();
query.add(new Term("TextContent", word));
booleanQuery.add(query, BooleanClause.Occur.SHOULD);
}
//collects the results via scoring them using a Similarity Function
TopScoreDocCollector collector = TopScoreDocCollector.create(reader.numDocs(), true);
searcher.search(booleanQuery, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println(hits.length);
for(int i = 0; i < 10; i++){
int id = hits[i].doc;
Document d = searcher.doc(id);
System.out.println(d.get("TextContent"));
}
I AM getting results from time to time but not enough and only for very popular search terms (for example "soccer" as a search term delivers me 15000 articles while there are several millions of news articles).
When I search for less popular terms that are contained by my textContent field I get 0 results. For example I have a document with the textcontent starting:
"Sonny Bill Williams will reunite with former All Blacks captain Tana Umaga [..]. The 29-year-old dual rugby international [...]"
If I know only add the word "rugby" in my List words I get 4125 results, in the top 10 also the article I just quoted. If I instead only add the word "Williams" (as the name of this rugby player - see the quote above) I get 0 results.
I don't understand this behaviour. I was speculating that it has to deal with the fact how I create the "TextContent" field in my Lucene index. Ongoing google research has lead me to several other stackoverflow questions (e.g. here and here). The difference to my question is that I AM getting results from time to time, but only for very popular terms.
Can you please tell me what I am doing wrong? Can you tell me how I should maybe alter my TextContent Field / FieldType to deliver better results? Or how should I maybe change my queries?
Thanks a lot for every answer and thought you're sharing with me.
UPDATE: NEW KNOWLEDGE ARRIVED
From this stackoverflow question I got the idea to try "williams" (all lowercase) instead of "Williams". The quote from one of the answers was:
The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words.
This worked. I am getting results if I write everything lowercased. I also checked my index with Luke and found out that all terms in my term vector are converted to lowercase. I will now leave this update here and give room for more potential answers coming (maybe still something is wrong/needs to be improved for better results). If no answers are incoming I will later post this as my answer.