How can this be accomplished in regex?
You can use the PyPi regex module that has a built-in fuzzy matching technique. This fuzziness technique allows you to specify what is and isn't allowed to happen on a string in terms of regex errors, how many are allowed, as well as a score for each type of fuzzy matching error. This information can be found here.
In terms of the regex, you would use the following (obviously if the regex you specified in your question is pre-determined, you can concatenate it; prepend (?:
and append ){e}
):
(?:abc.*Pickles.*efg){e}
In the regex above, {e}
permits any type of error (deletion, insertion, substitution) when matching the string. You can modify this to your liking, some examples listed below (many copied from the link I provided in the first paragraph):
{d}
deletion only
{d,i}
deletion and insertion only
{d,s,i}
deletion, substitution, or insertion (same as {e}
) - order does not matter
{e<=3}
permit at most 3 errors
{i<3}
permit fewer than 3 insertions
{1<=e<3}
permit at least 1, but fewer than 3 errors
{1<=2,d>2}
permit at most 2 insertions, permit at least 2 deletions
{2i+d2+1s<=4}
each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
{i<=1,d<=1,s<=1,2i+2d+1s<=4}
at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
What does its implementation look like?
Now that you have an understanding of what this is and how it works, we can move on to the actual implementation. The code below makes use of the regex module's fuzzy_counts
in the calculation to determine the best string based on its score. My calculation converts to percentage and works as follows (you can change it to your liking and, combined with the information above, change the scores for each type of error that the regex encounters):
# 100 * (1 - (fuzziness/length of original string))
100*(1 - (n/len(s)))
See this code working here
Note: I used pprint
to nicely display the output; not necessary to use it, just makes the results easier to read.
import regex, pprint
strings = [
"abc_Pickled_efg",
"abc_Pickller_efg",
"abcd_something_Picklesefgh",
"efg_Pickles_abc"
]
r = r'(?:abc.*Pickles.*efg){e}'
matches = []
for i,s in enumerate(strings):
x = regex.fullmatch(r,s, regex.BESTMATCH) # match result
n = sum(x.fuzzy_counts) # fuzziness
c = 100*(1 - (n/len(s))) # closeness
matches.append(c)
result = sorted(zip(strings, matches),key=lambda x: x[1],reverse=True)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(result)
And the result:
[ ('abcd_something_Picklesefgh', 96.15384615384616),
('abc_Pickled_efg', 93.33333333333333),
('abc_Pickller_efg', 87.5),
('efg_Pickles_abc', 60.0)]