1

I need to get the Vector Space Model(with tf-idf weighting) from the results of a lucene query, and cant figure out how to do it. It seems like it should be simple, and at this stage maybe one of you guys can point me in the right direction.

I have been trying to figure out how to do this for a good while, and either I haven't copped how the stuff i have read is what i need yet (more than likely), or a solution hasn't been posted to my particular problem. I even tried computing the VSM myself direct from the query results, but my solution has hideous complexity.

Edit: For anyone else who stumbles upon this, there is a solution @ the much clearer question here What i need can be gotten by the IndexReader.getTermFreqVector(String field, int docid) method.

Unfortunately this doesn't work for me as the index I am working off hasn't stored the term frequency vectors, so I guess I'm still looking for more help on this!

Community
  • 1
  • 1
Mark
  • 312
  • 4
  • 17

3 Answers3

3

To answer this question, you can compute a TF-IDF weighted vector space model for a set of lucene results using the IndexReader.getTermFreqVector() and Searcher.docFreq() classes. There is no way of directly getting the VSM for a set of results in Lucene.

Mark
  • 312
  • 4
  • 17
2

Maybe I'm misunderstanding what you're trying to do, but Lucene's scoring uses the vector space model. If you want more details for how the scores are calculated, given a document and a query, use Searcher.explain(Query query, int doc) .

bajafresh4life
  • 12,491
  • 5
  • 37
  • 46
  • Submit the text of each document as the query, and you'll get the cosine similarity for that document with every other document in your index. When you transform the text of the document into a query, make sure each term is an OR term. – bajafresh4life Jul 29 '10 at 16:25
1

If I understand correctly from your comment, you want the compute VSM cosine similarity between documents rather than between a query and a document. I don't know exactly how to do this, but I'd point you to the Lucene API page for the Similarity class. You'd probably have to derive and use a custom subclass of Similarity that changes the coord and queryNorm members and find a way to turn documents into query objects.

(No guarantees; I'm just trying to figure out this scoring myself.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Yep, thats what I'm looking for, I'll have a fresh look at the similarity class. Thanks for your help. – Mark Jul 29 '10 at 13:29