I am using org.apache.lucene.queryparser.classic.QueryParser
in lucene 6.0.0 to parse queries using a CustomAnalyzer
as shown below:
public static void testFilmAnalyzer() throws IOException, ParseException {
CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
.addCharFilter("patternreplace",
"pattern", "(movie|film|picture).*",
"replacement", "")
.withTokenizer("standard")
.build();
QueryParser qp = new QueryParser("name", nameAnalyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
String[] strs = {"avatar film fiction", "avatar-film fiction", "avatar-film-fiction"};
for (String str : strs) {
System.out.println("Analyzing \"" + str + "\":");
showTokens(str, nameAnalyzer);
Query q = qp.parse(str);
System.out.println("Parsed query of \"" + str + "\":");
System.out.println(q + "\n");
}
}
private static void showTokens(String text, Analyzer analyzer) throws IOException {
StringReader reader = new StringReader(text);
TokenStream stream = analyzer.tokenStream("name", reader);
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.print("[" + term.toString() + "]");
}
stream.close();
System.out.println();
}
I get the following output, when I invoke testFilmAnalyzer
:
Analyzing "avatar film fiction":
[avatar]
Parsed query of "avatar film fiction":
+name:avatar +name:fiction
Analyzing "avatar-film fiction":
[avatar]
Parsed query of "avatar-film fiction":
+name:avatar +name:fiction
Analyzing "avatar-film-fiction":
[avatar]
Parsed query of "avatar-film-fiction":
name:avatar
It seems like the analyzer uses the PatternReplaceCharFilter
in its correct intended order (i.e. before tokenization), while the QueryParser
does so afterwards. Does anyone have an explanation for that? Isn't that a bug?