Find substring in text which has the highest similarity to a given keyword

Question

Say I have this text = I love apples, kiwis, oranges and bananas and the searchString = kiwis and bananas and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text which has the highest similarity to searchString.

Basically I am trying to find portions of text (text has high errors, misspellings, extra symbols and spaces) which match a list of keywords I have.

i dont know much about this but this link might help ... http://stackoverflow.com/questions/5859561/getting-the-closest-string-match — Dandy, Sep 16 '16 at 13:08
https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance — Dandy, Sep 16 '16 at 13:19
@Dandy: I know about edit distance. This question is asking given strings S and T, find a substring of S that has the smallest edit distance (or any other similarity metric) to T. — pathikrit, Sep 16 '16 at 13:29
@Dandy: Thanks for the link but I know the min edit distance problem. Not sure how it applied to my question? Can you provide an answer below detailing what you are trying to say? — pathikrit, Sep 18 '16 at 17:08

dveim · Answer 1 · 2016-09-16T13:28:20.280

5

Jaccard index is "lucky" similarity algorithm, because you can update it's value for new symbol without recalculating all previous stuff. So, you can view text as a sequence of diffs for resulting index value. After that, problem can be reduced to https://en.wikipedia.org/wiki/Maximum_subarray_problem.

What about your second paragraph, if you are doing some NLP-like research, I'd suggest to clean your data (remove those extra symbols and spaces, whenever that's possible) before further processing. That's known as "spelling correction", and there are tons of different algorithms and libraries. To choose appropriate one, extra information about your domain is needed.

edited Sep 16 '16 at 13:28

answered Sep 16 '16 at 13:11

dveim

3,381
2
21
31

> Jaccard index is "lucky" similarity algorithm, because you can update it's value for new symbol without recalculating all previous stuff. Can you explain or provide a link regarding what you mean above? – pathikrit Sep 16 '16 at 13:13
1

Assume sets are `A = {1, 2, 3}` and `B = {1}`, JI is `1/3`. If you add new value to `B`, you can simply update numerator and denominator -- `B += {2}`, JI becomes `(1 + 1) / (3 + 0) = 2/3`, values `1` and `0` can be calculated without rest of `B`. As long as we work with sets, `contains` operation is pretty efficient. (This example can be easily extended to strings) – dveim Sep 16 '16 at 13:21
(For previous comment -- to check if, for example, `1` already was calculated in numerator, we need additional check, that can be done in amortized constant time too (however, required max space is space of `A`). Actually interesting question -- what would be more efficient, to find diffs for `searchString` and check `contains` for `text`, or visa verse. I bet that doesn't matter, and going to write proof. – dveim Sep 16 '16 at 13:33

score 2 · Answer 2 · answered Sep 16 '16 at 13:11

2

Take a look at shingling technique, and try to find the similarity. you can follow this link: http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

For example use 9 shingle and compare each subset with your specific keyword

answered Sep 16 '16 at 13:11

Masoud

1,343
8
25

I believe the algorithm I am using (Jacard index) is based on k-shingling anyway – pathikrit Sep 16 '16 at 13:30

score 1 · Answer 3 · edited Sep 27 '16 at 12:15

1

I Use Stemming and Levenshtein distance

This is the algorithm in action: https://wizsearch.wizsoft.com/index.php/demo/

This demo searches all wiki titles, try the "show search terms" option to see the Levenshtein distance and error correction algorithm in action.

edited Sep 27 '16 at 12:15

Bridge

29,818
9
60
82

answered Sep 23 '16 at 09:34

O_Z

1,515
9
11

score 0 · Answer 4 · answered Oct 01 '16 at 16:36

Every query term is checked against a dictionary. If a term is not found in the dictionary, then those words from the dictionary are shown as spelling suggestions, which are most similar to the query term in question.

Similarity / Edit distance As similarity measure between two words serves usually the Damerau-Levenshtein distance https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Few other references

Find substring in text which has the highest similarity to a given keyword

4 Answers4