0

I'm searching for a general advise on how to check the similarity between two running texts.

What I need is a idea/draft on a algorithm that compares two running texts with each other and outputs how similar both are, in best case with a good runtime.

For example text A is to 90% similar to text B.

Standart checks if text A contains keywords and passages of text B isn't enough for my case.

I googled a lot and the best i stumbled upon was text mining, but that's pretty much not what i was searching for.

Does a common solution for this kind problem exist or do i need for a more individual solution?

Update: A example: As I said it's a running text so a text can contain more than one or two sentences. More likely a text will contain 20-50 sentences but here is a short example.

Text A: "Lorem ipsum dolor sit amet, consectetuer adipiscing elit." Text B: "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa."

I would rate the text about 40-50% similar, because Text B contains the Text A full.

This part should be done by the algorithm - a deviation of below 10 percent is ok! ;)

But this was just for a simple example to understand. The texts I will use are sometimes not similar to each other at all!

  • 1
    I think you are looking for [near duplicated detection](http://stackoverflow.com/q/23053688/572670), where you use the Jaccard Similarity as your metric. – amit Jan 06 '16 at 21:04
  • Or if you want simpler, you can go with bag of words model, levenshtein distance, and more. – amit Jan 06 '16 at 21:07
  • Thanks for the replies! I added a example. The near duplicated detection looks intresting, but sometimes the two texts subjects dont relate to each other but should be compared how similar they are to each other. The problem by the near duplicated detection i see is, that it will be good for messuring similarties between 75- 100 percent and none. but not between 0% and 75%. I need messurements in the lower percentage area too. I've read about levenshtein distance, it just works on a quick way with two words to compare but sadly not with full texts. – El-Presidente Jan 06 '16 at 21:21
  • @El-Presidente You can easily change it, by defining a "special" alphabet, where each unique word is a character. Now, you basically have two strings that you can compare. Another option is [semantic relatedness](http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf), but that might require a lot of work to implement. – amit Jan 06 '16 at 21:26
  • @amit Thanks for the response and link! Looks very fine to me ;) – El-Presidente Jan 06 '16 at 22:07

0 Answers0