5

I used tf/idf to calculate consine similarity between two documents. It has some limitation and does not perform very well.

I looked for LDA (latent dirichlet allocation) to calculate document similarity. I don't know much about this. I couldn't find much stuff too about my problem.

Can you please provide me any tutorial related to my problem? Or can you give some advices how can i achive this task with LDA???

Thanks

P.S: also is there any source code availabe to perform such task with LDA??

user238384
  • 2,396
  • 10
  • 35
  • 36

4 Answers4

1

Have you had a look at Lucene and Mahout?

This might be useful - Latent Dirichlet Allocation with Lucene and Mahout.

Binary Nerd
  • 13,872
  • 4
  • 42
  • 44
  • Thanks, can you also please answer is it possible to calculate similarity between two documents with the help of LDA? As mostly people said it can be used for un-supervised clustering :( – user238384 Feb 17 '10 at 02:06
  • Sorry, i don't know enough about LDA to offer an experts answer to that, its not part of Mahout that i've used. However, my understanding of clustering is that your grouping objects based on some similarity measure, which in this case would be LDA. – Binary Nerd Feb 17 '10 at 02:50
0

A bit old, but for anyone still interested, take a look at this blog post (disclaimer: this is my own blog). The algorithm described there and the linked code will probably do what you need if you don't have your heart set on any specific approach.

Regarding Shashikant's comment, the cosine similarity may not be a good option because the signatures are proportional in length to the documents. Constant length signatures are preferable.

user1417684
  • 2,584
  • 1
  • 19
  • 14
0

Try this service for computing cosine similarity between two documents

http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject
0

You might be thinking of LSA (Latent Semantic Analysis) which is a very common solution to this kind of problem.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • Hi pace, Thanks for your reply. Yes i know about LSA and I also implemented it. I used JAMA package for SVD but I had a problem that if my rows are less than columsn it doesn't work :(. Can you tell me any other SMALL svd package? – user238384 Feb 17 '10 at 02:08