How to calculate "OnTopicness" of documents using Lucene.NET

Question

Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.

Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :

Forum-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified forum (including all threads in the forum)
Thread-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified thread

Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.

I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:

public void GetCosineSimilarity(Document doc1, Document doc2)
{
    using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
    {
        // how do I get the docNumbers?
        TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
        TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
        ...
        // assuming that I have the TermFreqVectors, how would I continue here?
    }
}

Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?

As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!

score 0 · Answer 1 · answered Sep 06 '11 at 14:52

0

Take a look at MoreLikeThisQuery in https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/

Its source may be useful.

answered Sep 06 '11 at 14:52

guest

17
2

Thanks for the answer. Could you please explain a little further what exactly I should take from the source to achieve my goal? As far as I understand, the MoreLikeThis query can extract and score important terms from my document based on the entire index. I am still not sure how to structure and compare two documents, though. – Shackles Sep 09 '11 at 11:38

score 0 · Answer 2 · answered Sep 07 '11 at 20:52

0

Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.

answered Sep 07 '11 at 20:52

kc2001

5,008
4
51
92

Unfortunately, S-Space is a Java implementation which is not an option for me since I am working in a .NET only environment. – Shackles Sep 09 '11 at 11:39

How to calculate "OnTopicness" of documents using Lucene.NET

2 Answers2