2

I need to calculate how much of a block of text (A) is in another block of text (B). Simple algorithms like soundex aren't providing great results for me as text B has additional text within it that isn't/shouldn't be in text A, which throws my figures off. I need to ensure a certain percentage of A is within B, and ignore the additions to B.

My first thought for a simple algorithm that might work well in my case would be to split A into sentences, note the total number of sentences, then search B for an instance of each sentence to provide a percentage. While this should work it feels quite hacky, and I'm sure someone more intelligent than I has devised an algorithm to provide a better calculation on a similar principle.

Phillip B Oldham
  • 18,807
  • 20
  • 94
  • 134
  • Try [diff match patch](https://code.google.com/p/google-diff-match-patch/)? – Abhinav Sarkar May 03 '13 at 09:01
  • 3
    There is a whole branch doing this, it is called [Plagiarism detection](http://en.wikipedia.org/wiki/Plagiarism_detection) – oleksii May 03 '13 at 09:03
  • Locality Sensitive Hashing might be an overkill, but you can get ideas from it. http://en.wikipedia.org/wiki/Locality-sensitive_hashing – anoopelias May 03 '13 at 12:29

1 Answers1

0

Longest Common Subsequence looks like best suited for your purposes.

Begelfor
  • 358
  • 3
  • 8