1

I'm working on a project that requires me to check if string1 is almost present in string2, if yes (i.e. if it matches more than some threshold ration say delta), then I need to extract that matched segment from string2 and save it.

string1 will range from 100 to 200 characters string2 will be of a larger length ranging anywhere between 15000 to 20000 characters.

examples which I am presently using

string1 = "MA A NA E LA OO KA A SA A BHA I YA A BA A HA U MA A DA A DA A A NGA GA I KA AA RA A PA A DDA A DA A NA A NA TA A RA A BA MA A SA U DA EE GA AA JA A SA A BHA E GA E BA A NA DA I TA U"

string2 = string2

I've used fuzzywuzzy and SequenceMatcher libraries in python, but I'm afraid I'm just able to get the threshold value using these, but not able to extract the substring from string2.

from fuzzywuzzy import fuzz
print(fuzz.partial_ratio(string1,string2))

After performing a fuzzywuzzy partialratio check on the two strings, I'm getting a ratio of 89.

I need to get a (approximate) substring from string2 which should almost be the same length of string1. Meaning, I need that 89% matched location of the string in string2.

droidmainiac
  • 198
  • 1
  • 9
  • Maybe you can find [Longest common subsequence](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) and check with the ratio? – nice_dev Apr 24 '19 at 14:08
  • The `fuzzywuzzy` code is open-source. It seems trivial to extend [the `partial_ratio` function](https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py#L34) to output the substring corresponding to the best match (much like you'd get [the index of the max element in an array](https://stackoverflow.com/questions/11301438/return-index-of-greatest-value-in-an-array/11301464#11301464)). – Bernhard Barker Apr 24 '19 at 15:46
  • @vivek_23 That would just take too much time, find all subsequences from both the strings and then compare each and everyone of them. – droidmainiac Apr 25 '19 at 05:36
  • @Abhijith Finding LCS does not work that way. See [here](https://www.geeksforgeeks.org/longest-common-subsequence-dp-4/) on the algorithm. – nice_dev Apr 25 '19 at 07:27

0 Answers0