Lucene comparing document contents

Question

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.

Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?

What sort of comparison are you looking at? Is this just similarity, or word frequency, or anything else? — mindas, Jun 23 '14 at 13:52
It is indeed a similarity comparison. I want to find all documents similar to the document that I used in my query. — user3767692, Jun 23 '14 at 14:20
This has already been discussed on SO: http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene — mindas, Jun 23 '14 at 14:21
Is cosine similarity the most efficient and recommended way of comparing one document to another? Are there alternatives to explore? — user3767692, Jun 23 '14 at 14:37
Cosine similarity would increase your index size (because of need to store term fq vectors) at the price of calculating similarity more quickly. As for "official" advice, I doubt there's any. — mindas, Jun 23 '14 at 14:43

score 0 · Accepted Answer · answered Jun 23 '14 at 16:13

I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:

MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);

Or if searching for a document already in the index:

MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);

Solr also includes a MoreLikeThis handler.

Lucene comparing document contents

1 Answers1