1

Let's say I have the following list:

abc_Pickled_efg
abc_Pickller_efg
abcd_something_Picklesefgh
efg_Pickles_abc

And the regex pattern:

abc.*Pickles.*efg

None of these in the list match the pattern. But how would I go about finding what the closest match is (in this case abc_Pickled_efg). Is there some Python module or something that can say if something is a 92% match or 0% match or something?

My thinking was that I can go across the pattern and see if each symbol matches i.e.:

a  => a = 1 point
b  => b = 1 point
c  => c = 1 point
.* => _ = 1 point
P  => P = 1 point
i  => i = 1 point
c  => c = 1 point
k  => k = 1 point
l  => l = 1 point
e  => e = 1 point
r  => d = 0 point
.* => _ = 1 point
e  => e = 1 point
f  => f = 1 point
g  => g = 1 point
-----------------
         14 points out of 15 possible ~= 93% match

Does this make sense or is there something that does something like this?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
TreeWater
  • 761
  • 6
  • 13
  • 1
    Does this answer your question? [Best Fuzzy Matching Algorithm?](https://stackoverflow.com/questions/491148/best-fuzzy-matching-algorithm) – Nir Alfasi Nov 06 '19 at 21:05
  • Could write a new regex engine that does this with special backtracking, etc.. Where set special options `Pickl?e?s?` –  Nov 06 '19 at 21:31
  • Why does `abc_Pickled_efg` present the closest match and not `abcd_something_Picklesefgh`? You're matching more in the second string than you are in the first. – ctwheels Nov 06 '19 at 22:10
  • @alfasin yes that looks like what I'm trying here – TreeWater Nov 07 '19 at 00:48
  • @ctwheels you are right. This is just an example I threw together, didn't really look all to closely but figured that I got my point across. Thanks – TreeWater Nov 07 '19 at 00:49

1 Answers1

1

How can this be accomplished in regex?

You can use the PyPi regex module that has a built-in fuzzy matching technique. This fuzziness technique allows you to specify what is and isn't allowed to happen on a string in terms of regex errors, how many are allowed, as well as a score for each type of fuzzy matching error. This information can be found here.

In terms of the regex, you would use the following (obviously if the regex you specified in your question is pre-determined, you can concatenate it; prepend (?: and append ){e}):

(?:abc.*Pickles.*efg){e}

In the regex above, {e} permits any type of error (deletion, insertion, substitution) when matching the string. You can modify this to your liking, some examples listed below (many copied from the link I provided in the first paragraph):

  • {d} deletion only
  • {d,i} deletion and insertion only
  • {d,s,i} deletion, substitution, or insertion (same as {e}) - order does not matter
  • {e<=3} permit at most 3 errors
  • {i<3} permit fewer than 3 insertions
  • {1<=e<3} permit at least 1, but fewer than 3 errors
  • {1<=2,d>2} permit at most 2 insertions, permit at least 2 deletions
  • {2i+d2+1s<=4} each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
  • {i<=1,d<=1,s<=1,2i+2d+1s<=4} at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4

What does its implementation look like?

Now that you have an understanding of what this is and how it works, we can move on to the actual implementation. The code below makes use of the regex module's fuzzy_counts in the calculation to determine the best string based on its score. My calculation converts to percentage and works as follows (you can change it to your liking and, combined with the information above, change the scores for each type of error that the regex encounters):

# 100 * (1 - (fuzziness/length of original string))
100*(1 - (n/len(s)))

See this code working here

Note: I used pprint to nicely display the output; not necessary to use it, just makes the results easier to read.

import regex, pprint

strings = [
    "abc_Pickled_efg",
    "abc_Pickller_efg",
    "abcd_something_Picklesefgh",
    "efg_Pickles_abc"
]

r = r'(?:abc.*Pickles.*efg){e}'
matches = []

for i,s in enumerate(strings):
    x = regex.fullmatch(r,s, regex.BESTMATCH)   # match result
    n = sum(x.fuzzy_counts)             # fuzziness
    c = 100*(1 - (n/len(s)))            # closeness
    matches.append(c)

result = sorted(zip(strings, matches),key=lambda x: x[1],reverse=True)

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(result)

And the result:

[   ('abcd_something_Picklesefgh', 96.15384615384616),
    ('abc_Pickled_efg', 93.33333333333333),
    ('abc_Pickller_efg', 87.5),
    ('efg_Pickles_abc', 60.0)]
ctwheels
  • 21,901
  • 9
  • 42
  • 77