Python regex problems with string matching

Question

Sorry in advance for such a long post

EDIT--

Modified from Norman's Solution to print and return if we find an exact solution, otherwise print all approximate matches. It's currently still only getting 83/85 matches for a specific example of searching for etnse on the dictionary file provided below on the third pastebin link.

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = []

    begin = time.time()

    # get all of the patterns associated with the given phrase
    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        for m in re.finditer(pattern, entireFile):
            startIndices.append((m.start(), m.end(), m.group()))
        # if the first pattern(exact match) is valid, then just print the results and we're done
        if len(startIndices) != 0 and startIndices[0][2] == origPattern:
            print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
            return

    print('Used {} patterns:').format(len(patterns))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}').format(i, p)

    # list for all non-overlapping starting indices
    nonOverlapping = []
    # hold the last matches ending position
    lastEnd = 0
    # find non-overlapping matches by comparing each matches starting index to the previous matches ending index
    # if the starting index > previous items ending index they aren't overlapping
    for start in sorted(startIndices):
        print(start)
        if start[0] >= lastEnd:
            # startIndicex[start][0] gets the ending index from the current matches tuple
            lastEnd = start[1]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:').format(len(startIndices))
    # i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
    for start in sorted(startIndices):
        # *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
        # for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
        print('- [{}:{}]  {}').format(*start)

    print()
    print('Found {} non-overlapping matches:').format(len(nonOverlapping))
    for ov in nonOverlapping:
        print('- [{}:{}]  {}').format(*ov)

    end = time.time()
    print(end-begin)

def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes. (replace)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # replace with a wildcard for each index
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions. (omitted)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # remove a char for each index
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        # insert a wildcard between adjacent chars for each index
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

ORIGINAL: http://pastebin.com/bAXeYZcD - the actual function

http://pastebin.com/YSfD00Ju - data to use, should be 8 matches for 'ware' but only gets 6

http://pastebin.com/S9u50ig0 - data to use, should get 85 matches for 'etnse' but only gets 77

I left all of the original code in the function because I'm not sure exactly what is causing the problem.

you can search for 'Board:isFull()' on anything to get the error stated below.

examples:

assume you named the second pastebin 'someFile.txt' in a folder named files in the same directory as the .py file.

file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")

OR

file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")

OR

assume you named the third pastebin 'dictionary.txt' in a folder named files in the same directory as the .py file.

file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")

--EDIT

The functions parameters work like so:

file is the location of a file.

origPattern is a phrase.

The function is basically supposed to be a fuzzy search. It's supposed to take the pattern and search through a file to find matches that are either exact, or with a 1 character deviation. i.e.: 1 missing character, 1 extra character, 1 replaced character, or 1 character swapped with an adjacent character.

For the most part it works, But i'm running into a few problems.

First, when I try to use something like 'Board:isFull()' for origPattern I get the following:

    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

the above is from the re library

I've tried using re.escape() but it doesn't change anything.

Second, when I try some other things like 'Fun()' it says it has a match at some index that doesn't even contain any of that; it's just a line of '*'

Third, When it does find matches it doesn't always find all of the matches. For example, there's one file I have that should find 85 matches, but it only comes up with like 77, and another with 8 but it only comes up with 6. However, they are just alphabetical so it's likely only a problem with how I do searching or something.

Any help is appreciated.

I also can't use fuzzyfinder

Please fix your indentation and create a [mcve](https://stackoverflow.com/help/mcve) — styvane, Apr 10 '16 at 06:16
How do I fix it on here? when I press tab it doesn't do anything to the code — Darrel Holt, Apr 10 '16 at 06:20
Share your code on pastebin or ideone..just for correct indentation — rock321987, Apr 10 '16 at 06:22

score 1 · Accepted Answer · answered Apr 10 '16 at 15:55

I found some issues in the code:

re.escape() seems to not work because its result is not assigned.
Do origPattern = re.escape(origPattern).
When pattern is correctly escaped, be mindful of not breaking the escaping when manipulating the pattern.
Example: re.escape('Fun()') yields the string Fun\(\). The two \( substrings in it must never be separated: never remove, replace, or swap a \ without the char it escapes.
Bad manipulations: Fun(\) (removal), Fu\n(\) (swap), Fun\.{0,2}\).
Good manipulations: Fun\) (removal), Fu\(n\) (swap), Fun.{0,2}\).
You find too few matches because you only try to find fuzzy matches if there are no exact matches. (See line if indices.__len__() != 0:.) You must always look for them.
The loops inserting '.{0,2}' produce one too many pattern, e.g. 'ware.{0,2}' for ware. Unless you intend that, this pattern will find wareXY which has two insertions.
The patterns with .{0,2} don't work as described; they allow one change and one insertion.
I'm not sure about the code involving difflib.Differ. I don't understand it, but I suspect there should be no break statements.
Even though you use a set to store indices, matches from different regexes may still overlap.
You don't use word boundaries (\b) in your regexes, though for natural language that would make sense.
Not a bug, but: Why do you call magic methods explicitly?
(E.g. indices.__len__() != 0 instead of len(indices) != 0.)

I rewrote your code a bit to address any issues I saw:

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = {}

    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))

    print('Used {} patterns:'.format(len(patterns)))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}'.format(i, p))

    nonOverlapping = []
    lastEnd = 0
    for start in sorted(startIndices):
        if start >= lastEnd:
            lastEnd = startIndices[start][0]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:'.format(len(startIndices)))
    for i in sorted(startIndices):
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))

    print()
    print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
    for i in nonOverlapping:
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))


def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

Thanks a lot for that, I'm going to run it right now. The yield keyword you used for finding the matches means you're using generators in the for loops, right? I had no idea what the yield keyword was until you used it in this code so I had to go look it up. — Darrel Holt, Apr 11 '16 at 00:54
I just noticed it still doesn't get all 85, but it got 79 this time. I wonder if maybe the data I was given doesn't have 85 possible matches because I can't imagine this being any better than it already is. — Darrel Holt, Apr 11 '16 at 01:01
On the first for loop in doMatching() where it iterates through each pattern from generateFuzzyPatterns() how does it get every pattern that generateFuzzyPatterns() 'yields'? I thought the yield function was like a return function where it would break out of the function to return. How does it not yield the 'exact matches' pattern every time it's called with the aforementioned for loop? — Darrel Holt, Apr 11 '16 at 01:15
There is a problem with using a dictionary I just found out. lets say the origPattern passed is 'zenith'. it adds it when it goes over exact match, but when it matches with removals it updates the index that matched 'zenith' with 'zenth' for example. Thus throwing away the exact match. This is also why it comes up with 79 matches instead of 85, because it overwrites the previous matches. I think i'll have to make a dict of lists for it to add all possibilities from what I've read but I haven't figured it out yet. — Darrel Holt, Apr 11 '16 at 04:13
You're right. A dict sounds good, with tuples of (start, end) as keys. Also, verify that insertions work correctly: the code also inserts a wildcard before and after the word, which disagrees with your comment ("insert *between* adjacent chars"). Also, who says there's 8 matches? It's easy to manually verify that and then contrast with the code's result. — Norman, Apr 11 '16 at 23:22
Actually, I made it into a 3 part tuple to completely avoid any conflicts. It isn't as pretty as your original code but it works flawlessly. It returns every match. As for your question about the 8 matches, my professor said there was supposed to be 8 matches. But the code was actually more efficient than hers and found more matches than she had originally stated. It's also 40-400 times as fast as her horspool matching method. Would you by chance know if the re.findIter() function uses Thompson's construction? — Darrel Holt, Apr 12 '16 at 03:04
That it found *more* matches may be an error. Currently, the list of non-overlapping matches may contain duplicates because duplicate fuzzy patterns are generated for an origPattern like "see". -- If you assume higher efficiency because of quicker execution, then note that performance is inhomogeneous in python: certain functionality is implemented in C, esp. regex matching, and that may be 400x faster than when not done in C. — Norman, Apr 12 '16 at 06:31
Also note that not all matches of periodic patterns are found, e.g. "dada" in "dadada". About Thompson: Don't know but Python is open-source, so one could go and inspect it. — Norman, Apr 12 '16 at 07:02
It wasn't an error; She saw that I was correct and made the proper adjustments to the project requirements. She also only wanted overlapping matches, so there was no problem with duplicates for the non-overlapping matches. I actually looked at [this](https://hg.python.org/cpython/file/a82fc1f366b4/Modules/_sre.c#l1533). The function beginning on line 1533 makes me feel like it does use Thompson's because it mentions states, but I'm not very adept with higher complexity algorithms involving things like NFA and DFA because I don't take classes covering them until next semester. — Darrel Holt, Apr 12 '16 at 17:14

Python regex problems with string matching

1 Answers1