2

I am using lucene for indexing and searching. Below is my code I use for searching. But in the current code the results are sorted. But I want the results to be based on the relevance. Suppose If I search for a word like "a b c", I want my search get the results that match "a b c" and then "a b" or "b c" and finally "a", "b", "c" but currently the results are sorted.

Can some one suggest me how to retrieve the results based on the relevance, when we do search on multiple words. Thanks for your help.

Emil
  • 7,220
  • 17
  • 76
  • 135
Lolly
  • 34,250
  • 42
  • 115
  • 150
  • possible duplicate of [How to get a Token from a Lucene TokenStream?](http://stackoverflow.com/questions/2638200/how-to-get-a-token-from-a-lucene-tokenstream) – Lolly May 16 '13 at 06:45
  • If you see a post that is a duplicate, please click the "flag" button and flag it as a duplicate, and it will be closed automatically. There is no need to remove the content in the posts :) – Emil May 16 '13 at 07:02

1 Answers1

7

By default, Lucene sorts based on TEXT-RELEVANCE only. There are quite a few factors that contribute to the relevance score.

It is possible that tf-idf values and length normalization might have affected your scores resulting in having "a b" / "b c" documents show up at top ranked results than the documents containing "a b c".

The way you can overcome above is that To boost the relevance score based on number of matching query terms. You may follow the below steps.

1) Write a customized Similarity class extending from DefaultSimilarity. If you are wondering what's Similarity, it is the class used by Lucene that contains all the formulas of scoring factors that contribute to the score.

Tutorial : Lucene Scoring

2) Override DefaultSimilarity.coord()

coord() explanation in the Lucene documentation.

coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time. 

3) The default implementation of coord is overlap/maxoverlap. You may experiment with different formulas such that the documents containing more query words show up in the top results. The following formulas might be good starting points.

   1) coord return value = Math.sqrt(overlap/maxoverlap)
   2) coord return value = overlap;

4) You do NOT have to override other methods since the DefaultSimilarity has default implementations for all scoring factors. Just touch the one you want to experiment with, which is coord() in your case. If you extend from Similarity, you've to provide all the implementations.

5) Similarity can be passed to the IndexSearcher using IndexSearcher.setSimilarity()

phanin
  • 5,327
  • 5
  • 32
  • 50