1

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.

Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?

Community
  • 1
  • 1
user3767692
  • 25
  • 1
  • 6
  • What sort of comparison are you looking at? Is this just similarity, or word frequency, or anything else? – mindas Jun 23 '14 at 13:52
  • It is indeed a similarity comparison. I want to find all documents similar to the document that I used in my query. – user3767692 Jun 23 '14 at 14:20
  • This has already been discussed on SO: http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene – mindas Jun 23 '14 at 14:21
  • Is cosine similarity the most efficient and recommended way of comparing one document to another? Are there alternatives to explore? – user3767692 Jun 23 '14 at 14:37
  • Cosine similarity would increase your index size (because of need to store term fq vectors) at the price of calculating similarity more quickly. As for "official" advice, I doubt there's any. – mindas Jun 23 '14 at 14:43

1 Answers1

0

I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:

MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);

Or if searching for a document already in the index:

MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);

Solr also includes a MoreLikeThis handler.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87