First, know that the boost is not a constant weight assigned to each query; rather, it's a multiplier. So when you set the boost to 1 on query #4 and to 3 on query #3, it's theoretically possible that query #4 ends up with a higher "boosted score" if its base score is more than three times that of query #3. To avoid that kind of problem, you can mark the score of each query as constant (use .boostedTo(3l).withConstantScore().onField("tags")
instead of .onField("tags").boostedTo(3l)
.
Second, the phrase query is not what you think it is. The phrase query accepts a multi-term input string, and will look for documents that contain these terms in the same order. Since you passed a single term, it's pointless. So you need something else.
Query 1: Results having the word at the beginning
I believe the only way to do exactly what you want are span queries. However, they are not part of the Hibernate Search DSL, so you'll have to rely on low-level Lucene APIs. What's more, I've never used them, and I'm not sure how they are supposed to be used... What little I know was taken from Elasticsearch's documentation, but the Lucene documentation is severely lacking.
You can try something like this, but if it doesn't work you'll have to debug it yourself (I don't know more than you do):
QueryBuilder queryBuilder = fullTextEntityManager2.getSearchFactory()
.buildQueryBuilder()
.forEntity(ProductEntity.class)
.get();
Analyzer analyzer = fullTextEntityManager.getSearchFactory()
.getAnalyzer(ProductEntity.class);
Query myQuery = queryBuilder
.bool()
.should(new BoostQuery(new ConstantScoreQuery(createSpanQuery(qb, "description", query, analyzer)), 9L))
[... add other clauses here...]
.createQuery();
// Other methods (to be added to the same class)
private static Query createSpanQuery(QueryBuilder qb, String fieldName, String searchTerms, Analyzer analyzer) {
BooleanJunction bool = qb.bool();
List<String> terms = analyze(fieldName, searchTerms, analyzer);
for (int i = 0; i < terms.size(); ++i) {
bool.must(new SpanPositionRangeQuery(new SpanTermQuery(new Term( fieldName, terms.get(i))), i, i);
}
return bool.createQuery();
}
private static List<String> analyze(String fieldName, String searchTerms, Analyzer analyzer) {
List<String> terms = new ArrayList<String>();
try {
final Reader reader = new StringReader( searchTerms );
final TokenStream stream = analyzer.tokenStream( fieldName, reader );
try {
CharTermAttribute attribute = stream.addAttribute( CharTermAttribute.class );
stream.reset();
while ( stream.incrementToken() ) {
if ( attribute.length() > 0 ) {
String term = new String( attribute.buffer(), 0, attribute.length() );
terms.add( term );
}
}
stream.end();
}
finally {
stream.close();
}
}
catch (IOException e) {
throw new IllegalStateException( "Unexpected exception while analyzing search terms", e );
}
return terms;
}
Query 2: Results having the word in second or third position
I believe you can use the same code as for query 1, but adding an offset. If the actual position doesn't matter, and you'll accept words in fourth or fifth position, you can simply do this:
queryBuilder.keyword().boostedTo(5l).withConstantScore()
.onField("description").matching(query)
.createQuery()
Query 3: Results having the word in a phrase(MilkShake)
From what I understand, you mean "results containing a word that contains the search term".
You could use wilcard queries for that, but unfortunately these queries do not apply analyzers, resulting in case-sensitive search (among other problems).
Your best bet is probably to define a separate field for this query, e.g. description_ngram
, and assign a specially-crafted analyzer to it, one which uses the ngram tokenizer when indexing. The ngram tokenizer simply takes an input string and transforms it to all its substrings: "milkshake" would become ["m", "mi", "mil", "milk", ..., "milkshake", "i", "il", "ilk", "ilks", "ilksh", ... "ilkshake", "l", ... "lkshake", ..., "ke", "e"]
. Obviously it takes a lot of disk space, but it can work for small-ish datasets.
You will find instructions for a similar use case here. The answer mentions a different analyzer, "edgengram", but in your case you'll really want to use the "ngram" analyzer.
Alternatively, if you're sure the indexed text is correctly formatted to clearly separate components of a "composite" word (e.g. "milk-shake", "MilkShake", ...), you can simply create a field (e.g. description_worddelimiterfilter
) that uses an analyzer with a word-delimiter filter (see org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter
) which will split these composite words. Then you can simply query like this:
queryBuilder.keyword().boostedTo(3l).withConstantScore()
.onField("description_worddelimiterfilter")
.matching(query)
.createQuery()