Algorithm to calculate how much of text A is in text B?

Question

I need to calculate how much of a block of text (A) is in another block of text (B). Simple algorithms like soundex aren't providing great results for me as text B has additional text within it that isn't/shouldn't be in text A, which throws my figures off. I need to ensure a certain percentage of A is within B, and ignore the additions to B.

My first thought for a simple algorithm that might work well in my case would be to split A into sentences, note the total number of sentences, then search B for an instance of each sentence to provide a percentage. While this should work it feels quite hacky, and I'm sure someone more intelligent than I has devised an algorithm to provide a better calculation on a similar principle.

Try [diff match patch](https://code.google.com/p/google-diff-match-patch/)? — Abhinav Sarkar, May 03 '13 at 09:01
There is a whole branch doing this, it is called [Plagiarism detection](http://en.wikipedia.org/wiki/Plagiarism_detection) — oleksii, May 03 '13 at 09:03
Locality Sensitive Hashing might be an overkill, but you can get ideas from it. http://en.wikipedia.org/wiki/Locality-sensitive_hashing — anoopelias, May 03 '13 at 12:29

score 0 · Answer 1 · answered May 15 '13 at 19:44

0

Longest Common Subsequence looks like best suited for your purposes.

answered May 15 '13 at 19:44

Begelfor

358
3
8

Algorithm to calculate how much of text A is in text B?

1 Answers1