The following notes are based on Lucene 8.x, not on Lucene 6.6 - so there may be some syntax differences - but I take your point about how any such differences should be coincidental to your question.
Here are some notes, where I will focus on the following aspect of your question:
However, this way I can't correctly resolve a query like: +type:"RockMusic" +title:"a sample title" using the standard analyzer
I think there are 2 parts to this:
Firstly, the query example using "a sample title"
will - as you say - not work well with how a standard analyzer works - for the reasons you state.
But, secondly, it is possible to combine the two types of query you want to use, in a way which I believe gets you what you need: An exact match for the type
field (e.g. RockMusic
) and a more traditional tokenized & case-insensitive result for the title
field (a sample title
).
Here is how I would do that:
Here is some simple test data:
public static void buildIndex() throws IOException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "another different title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "Rock Music", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
}
}
Here is the query code:
public static void doSearch() throws QueryNodeException, ParseException, IOException {
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery typeQuery = new TermQuery(new Term("type", "RockMusic"));
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("title", analyzer);
Query titleQuery = parser.parse("A Sample Title");
Query query = new BooleanQuery.Builder()
.add(typeQuery, BooleanClause.Occur.MUST)
.add(titleQuery, BooleanClause.Occur.MUST)
.build();
System.out.println("Query: " + query.toString());
System.out.println();
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
System.out.println("doc = " + hit.doc + "; score = " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println("Type = " + doc.get("type")
+ "; Title = " + doc.get("title"));
System.out.println();
}
}
The output from the above query is as follows:
Query: +type:RockMusic +(title:a title:sample title:title)
doc = 0; score = 0.7016101
Type = RockMusic; Title = a sample title
doc = 1; score = 0.2743341
Type = RockMusic; Title = another different title
As you can see, this query is a little different from the one taken from your question.
But the list of found documents shows that (a) the Rock Music
document was not found at all (good - because Rock Music
does not match the "type" search term of RockMusic
); and (b) the title a sample title
got a far higher match score than the another different title
document, when searching for A Sample Title
.
Additional notes:
This query works by combining a StringField
exact search with a more traditional TextField
tokenized search - this latter search being processed by the StandardAnalyzer
(matching how the data was indexed in the first place).
I am making an assumption about the score ranking being useful to you - but for title searches, I think that is reasonable.
This approach would also apply to your BRAIN
vs. brain
example, for StringField
data.
(I also assume that, for a user interface, a user could select the "RockMusic" type value from a drop-down, and enter the "A Sample Title" search in an input field - but this is getting off-topic, I think).
You could obviously enhance the analyzer to include stop-words, and so on, as needed.
Of course, my examples involve hard-coded data - but it would not take much to generalize this approach to handle dynamically-provided search terms.
Hope that this makes sense - and that I understood the problem correctly.