How to find out if string contains substring or something similar to it

Question

There is two string, str1 is pattern, str2 is a long text

str1 = 'how to do this weird task'
str2 = 'once upon a time...and smth long'

How to find out if str2 contains str1 or something similar to it - not necessarily equal to str1

Now i use Levenshtain.ratio, a window with a length of str1 above str2.

res = [[str2[i:i+len(str1)],str1,ratio(str2[i:i+len(str1)],str1)] for i in range(len(str2)-len(str1))]

and choose maximum in res[:,2], but maybe smth better was created

I think you should take a look at this question and other questions linked in its comments: [How to find a similar substring inside a large string with a similarity score in python?](https://stackoverflow.com/questions/48117508) — Jorge Luis, Mar 22 '23 at 09:20

angwrk · Accepted Answer · 2023-03-22T09:34:09.140

You can use tokenization, something lilke this:

from fuzzywuzzy import fuzz
import re

str1 = 'how to do this weird task'
str2 = 'Once upon a time, there was a person who wanted to know how to accomplish this weird task.'

str1 = str1.lower()
str2 = str2.lower()

pattern_words = re.findall(r'\w+', str1)

best_match = None
best_ratio = 0
for i in range(len(str2)):
    text_words = re.findall(r'\w+', str2[i:])
    if len(text_words) < len(pattern_words):
        break
    ratios = [fuzz.ratio(w, text_words[j]) for j, w in enumerate(pattern_words)]
    avg_ratio = sum(ratios) / len(ratios)
    if avg_ratio > best_ratio:
        best_match = ' '.join(text_words[:len(pattern_words)])
        best_ratio = avg_ratio

threshold_ratio = 80
if best_ratio >= threshold_ratio:
    print(f"Found a match: '{best_match}' (ratio={best_ratio})")
else:
    print("No match found")

Output:

Found a match: 'how to accomplish this weird task' (ratio=86.16666666666667)

How to find out if string contains substring or something similar to it

1 Answers1