4

Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.

Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :

  • Forum-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified forum (including all threads in the forum)
  • Thread-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified thread

Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.

I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:

public void GetCosineSimilarity(Document doc1, Document doc2)
{
    using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
    {
        // how do I get the docNumbers?
        TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
        TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
        ...
        // assuming that I have the TermFreqVectors, how would I continue here?
    }
}

Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?

As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!

Community
  • 1
  • 1
Shackles
  • 1,264
  • 1
  • 19
  • 40

2 Answers2

0

Take a look at MoreLikeThisQuery in https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/

Its source may be useful.

guest
  • 17
  • 2
  • Thanks for the answer. Could you please explain a little further what exactly I should take from the source to achieve my goal? As far as I understand, the MoreLikeThis query can extract and score important terms from my document based on the entire index. I am still not sure how to structure and compare two documents, though. – Shackles Sep 09 '11 at 11:38
0

Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.

kc2001
  • 5,008
  • 4
  • 51
  • 92
  • Unfortunately, S-Space is a Java implementation which is not an option for me since I am working in a .NET only environment. – Shackles Sep 09 '11 at 11:39