8
bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
smallString = "I_AM_HERE"

Which efficient algorithm should I use to find a substring of the "bigString" that matches closely to the "smallString"

output = "I_AM_THERE"

The output may have few insertions and deletions when compared with small string.

Edit: Found a good example, very close to my problem here: How to add variable error to regex fuzzy search. Python

Community
  • 1
  • 1
user1140126
  • 2,621
  • 7
  • 29
  • 34
  • 2
    Define "matches closely". – jazzpi Nov 12 '13 at 21:12
  • 2
    Define "matches closely". – Scott Hunter Nov 12 '13 at 21:12
  • When you say `that matches closely ` do you mean exact match only or a fuzzy match? – dawg Nov 12 '13 at 21:12
  • 8
    @ScottHunter Well that's awkward... – jazzpi Nov 12 '13 at 21:12
  • 3
    @jazzpi: Great minds not only think alike, but at the same time... – Scott Hunter Nov 12 '13 at 21:21
  • Without knowing exactly what you mean by "matching closely", nobody can give you a real answer. However, Wikipedia has a nice article on [approximate string matching](http://en.wikipedia.org/wiki/Approximate_string_matching) that describes some of the choices, and links to some of the major algorithms (many of which have links to Python implementations, or at least to easy psuedocode that you can convert to Python). Hopefully that's enough to get you started. – abarnert Nov 12 '13 at 21:24
  • Can characters be deleted? Can characters be inserted? Can characters be swapped? Can characters merely be replaced? Maybe check out https://pypi.python.org/pypi/python-Levenshtein/ – dstromberg Nov 12 '13 at 21:28
  • @dawg, its not exact. There may be some insertions and deletions – user1140126 Nov 12 '13 at 21:36
  • @dstromberg, actually I was using "Levenshtein" package and the `ratio` and `distance` package is not suitable here – user1140126 Nov 12 '13 at 21:38
  • Take a look at the new [regex](https://pypi.python.org/pypi/regex) module – dawg Nov 12 '13 at 21:38
  • @dawg, thanks for pointing me to the module. I think it is helpful and can solve my problem – user1140126 Nov 12 '13 at 22:14

3 Answers3

8

You can use the almost-ready-to-be-everyones-regex package with fuzzy matching:

>>> import regex
>>> bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
>>> regex.search('(?:I_AM_HERE){e<=1}',bigString).group(0)
'I_AM_THERE'

Or:

>>> bigString = "AGAH_I_AM_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_NOWHERE_EREXXMHHGRFS"
>>> print(regex.findall('I_AM_(?:HERE){e<=3}',bigString))
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']

The new regex module will (hopefully) be part of Python3.4

If you have pip, just type pip install regex or pip3 install regex until Python 3.4 is out (with regex part of it...)


Answer to comment Is there a way to know the best out of the three in your second example? How to use BESTMATCH flag here?

Either use the best match flag (?b) to get the single best match:

print(regex.search(r'(?b)I_AM_(?:ERE){e<=3}', bigString).group(0))
# I_AM_THE

Or combine with difflib or take a levenshtein distance with a list of all acceptable matches to the first literal:

import regex

def levenshtein(s1,s2):
    if len(s1) > len(s2):
        s1,s2 = s2,s1
    distances = range(len(s1) + 1)
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1])
            else:
                newDistances.append(1 + min((distances[index1],
                                             distances[index1+1],
                                             newDistances[-1])))
        distances = newDistances
    return distances[-1]

bigString = "AGAH_I_AM_NOWHERE_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_HERE_EREXXMHHGRFS"
cl=[(levenshtein(s,'I_AM_HERE'),s) for s in regex.findall('I_AM_(?:HERE){e<=3}',bigString)]

print(cl)
print([t[1] for t in sorted(cl, key=lambda t: t[0])])

print(regex.search(r'(?e)I_AM_(?:ERE){e<=3}', bigString).group(0))

Prints:

[(3, 'I_AM_NOWHERE'), (1, 'I_AM_THERE'), (0, 'I_AM_HERE')]
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']
zelusp
  • 3,500
  • 3
  • 31
  • 65
dawg
  • 98,345
  • 23
  • 131
  • 206
  • That looks handy. Thanks for the preview. – beroe Nov 13 '13 at 04:50
  • @dawg, Thanks. Is there a way to know the best out of the three in your second example? How to use `BESTMATCH` flag here? – user1140126 Nov 13 '13 at 15:36
  • Hi @dawg - this is a great answer. I've edited to reflect how to use the *under-undocumented* BESTMATCH flag `(?b)` instead of the similar but different ENHANCEMATCH flag `(?e)`. Cheers! – zelusp Apr 24 '16 at 01:21
  • ... [Here's](http://stackoverflow.com/questions/36818463/how-can-i-find-the-best-fuzzy-string-match) a simple demo of `(?b)` in action. – zelusp Apr 24 '16 at 02:19
  • @dawg where is the documentation explaining the regex usage? The pypi page and homepage for `regex` does not give working examples of this `{e<=3}` type error/fuzzy syntax. Im trying to find all match candidates above a threshold percentage match. I want the matches and their percentages. So far examples I've found only provide the candidates above a threshold but not the percentages for each candidate – data_steve Aug 21 '20 at 14:48
  • The documentation for the fuzzy su=yntax is in the [PyPI documentation page](https://pypi.org/project/regex/) for the regex module. – dawg Aug 21 '20 at 14:51
0

Here is a bit of a hacky way to do it with difflib:

from difflib import *

window = len(smallString) + 1  # allow for longer matches
chunks = [bigString[i:i+window] for i in range(len(bigString)-window)]
get_close_matches(smallString,chunks,1)

Output:

['_I_AM_THERE']
beroe
  • 11,784
  • 5
  • 34
  • 79
0

Maybe the dynamic programming problem Longest Common Substring would be of some use here. Depending on your needs and matching criteria you could perhaps use Longest Common Subseuence

Paul
  • 7,155
  • 8
  • 41
  • 40