1

Is there a built-in algorithm to find the similarity between two documents in lucene ? When i went through the default similarity class , it gives the score as a result after comparing the query and the document.

I have already indexed my document a, used the snowball analyzer , the next step would be to find the similarity between the two documents .

Can somebody suggest a solution ?

CTsiddharth
  • 907
  • 12
  • 21
  • http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene – Mikos Feb 16 '12 at 21:07

1 Answers1

0

There does not seem to be a built-in algorithm. I believe there are three ways you can go with this:

a) Run a MoreLikeThis query on one of the documents. Iterate through the results, check for doc id and get score. Maybe not pretty, you might need to return a lot of documents for the one you want to be among the returned ones.

b) Cosine Similarity: the answers at the link Mikos provided in his comment explain how Cosine similarity can be computed for two documents.

c) Compute your own Lucene Similarity Score. The Lucene score adds a few factors to Cosine Similarity (http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).

You can use

DefaultSimilarity ds = new DefaultSimilarity();
SimScorer scorer = ds.simScorer(stats , arc);
scorer.score(otherDocId, freq);

You can get the parameters for example through

AtomicReaderContext arc = IndexReader.leaves().get(0);
SimWeight stats = ds.computeWeight(1, collectionStats, termStats);
stats.normalize(1, 1);

where in turn you can get the term stats using the TermVector for the first of your two documents, and your IndexReader for collection stats. To get the freq parameter, use

DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, null, field, term);

, iterate through the docs until you find the doc id of your first document, and do

freq = docsEnum.freq();

Note that you need to call "scorer.score" for each term (or each term you want to consider) in your first document, and sum up the results.

In the end, to multiply with the "queryNorm" and "coord" parameters, you can use

//sumWeights was computed while iterating over the first termvector
//in the main loop by summing up "stats.getValueForNormalization();"
float queryNorm = ds.queryNorm(sumWeights);
//thisTV and otherTV are termvectors for the two documents.
//overlap can be easily calculated
float coord = ds.coord(overlap, (int) Math.min(thisTV.size(), otherTV.size()));
return coord * queryNorm * score;

So this is a way that should work. It is not elegant and due to the difficulty of getting term frequencies (iterate over DocsEnum for each term), it is not very efficient either. I still hope something of this helps someone :)

Chris Paul
  • 11
  • 1