Measuring similarity between document sets

Question

For illustration purposes, let's assume this is a forum service. I need to calculate the "similarity" among each users' posts, so that the result would be something like:

among posts by user A, similarity 60%
among posts by user B, similarity 20%
...

I'm dealing with multibyte strings, so I guess I'm stuck with search engines here. We already use Solr, already have moreLikeThis implemented, but I'm not quite sure how to construct the query. Any help appreciated!

You need to define what you consider "similar" and how you want to model it. Levenshtein distance? Markov Chains? — Kajetan Abt, May 20 '11 at 09:34
Actually I don't really care, in the sense that I'm willing to let Solr's moreLikeThis feature decide for me. But instead of the standard "get me more articles like this one, based on that similarity scoring thing you do", what I'm trying to do here is "get me the similarity score among these articles". — jodeci, May 23 '11 at 01:51

score 1 · Answer 1 · answered Sep 15 '11 at 19:09

1

Possibly Carrot2 will interest you (and this blog related to it)

answered Sep 15 '11 at 19:09

Omnaest

3,096
1
19
18

score 0 · Answer 2 · answered Jul 27 '11 at 20:30

strange question in two ways: 1. Why do you have to deal with SOLR? 2. The kind of similarity depends on the target problem. Your question sounds too generic to me. There is research going on in the area of semantic similarity. There is edit-distance algorithm, which is probably not what you want.

So, define you question more precisely and you get better answers.

score 0 · Answer 3 · answered Dec 09 '11 at 05:18

0

There are several measures of similarity, a simple and effective one is Cosine similarity. There are more sophisticated ones such as Smith-Waterman etc,

Look at http://sourceforge.net/projects/simmetrics/

answered Dec 09 '11 at 05:18

Mikos

8,455
10
41
72

Measuring similarity between document sets

3 Answers3

Linked