Implementing pattern matching - When is the cache being used?

Question

The task is to implement a simple pattern matching for the whole string where the pattern may include . and *.

The following code makes a DFS while maintaining two pointers for the text and the pattern. It also makes use of a cache for the obvious cause of not repeating work.

I was trying to test isMatch with several inputs but I don't see the cache is ever used ("from cache" line is never called). Can you help me show an example where the cache is being used?

Thanks

def isMatch(s, p):
    cache = {}

    def dfs(i, j):
        print(f"dfs was called with i={i}, j={j}")
        if (i, j) in cache:
            print((i, j), "from cache")
            return cache[(i, j)]
        
        if i >= len(s) and j >= len(p):
            return True
        if j >= len(p):
            return False

        match = i < len(s) and (s[i] == p[j] or p[j] == ".")
        if (j + 1) < len(p) and p[j+1] == "*":
            cache[(i, j)] = dfs(i, j+2) or (match and dfs(i + 1, j))
            return cache[(i, j)]

        if match:
            cache[(i, j)] = dfs(i + 1, j + 1)
            return cache[(i, j)]

        cache[(i, j)] = False
        return False

    return dfs(0, 0)


res = isMatch("aaaaaabbbbbb", "a*b*")
print(res)

UmNyobe · Answer 1 · 2021-09-12T18:33:39.213

edit: The cache is useful in this cases "ab", "a*a*"

Conditions :

The character a can both match the first and the second sub-expressions.
There is no match overall

The node (1,2) can be reached from two paths

First path explored = skip first sub-expression (0, 0) -> (0, 2) -> (1, 2)
Second path explored = match first sub-expression (0, 0) -> (1, 0) -> (1, 2)

The issue here is dfs(i, j+2) being false while there is a match between s[i] and p[j+2].

dfs(i, j+2) will trigger dfs(i, j+4), dfs(i, j+6), ..., dfs(i, j+2k)
dfs(i, j+4) will trigger dfs(i, j+6), ...., dfs(i, j+2k)
...
dfs(i, j+2k-2) will trigger dfs(i, j+2k)

You only need to cache dfs(i, j+2)

v = dfs(i, j+2)
cache[(i, j+2)] = v
return v or (match and dfs(i + 1, j))

wrong answer : This cache is never read, regardless of s and p

To have a cache hit you need a given tuple (x, y) with multiple paths leading from (0, 0) to (x, y). To reach (x, y) from multiple paths there need to be a branch.

So when do we branch ? Only place is

cache[(i, j)] = dfs(i, j+2) or (match and dfs(i + 1, j))

When the regular expression eventually match till a repeated pattern like a*
dfs(i, j+2) is false (if it is true, the whole expression is already true, and due to short circuit execution, the right part of the or is never run)
match is true : again, due to short circuit, if there is no match, the right part is not executed
dfs(i+1, j) is also false. (x, y) cannot be both true and false.

So The cache is theoretically hit when s does not match p.

We store false match only when we ran out of 0 or more expression like a* and the two digits don't match (eg isMatch('a', 'b')). At this point, the algorithm need to stop anyway, so there are no reason to cache anything.

When this happen, we are here :

        cache[(i, j)] = False
        return False

At this point, there are no more branch. Ie there is only one way to get there. So the cache is never read.

does that mean there's a bug in the author's algorithm or does this algorithm doesn't need a cache? Also, how do you explain the fact that after applying the cache mechanism, the runtime was reduced drastically to 44ms? — Elimination, Sep 11 '21 at 15:07

Implementing pattern matching - When is the cache being used?

1 Answers1