I have actually answered a nearly identical question before, which can be found here:
https://stackoverflow.com/a/55914487/2284490
The only difference is that you want to know len(matches)
instead of any(matches)
on the algorithm.
This can be solved efficiently as variations on the Aho Corasick algorithm
It is an efficient dictionary matching algorithm that locates patterns within text simultaneously in O(p + q + r)
, with p
= length of patterns, q
= length of text, r
= length of returned matches.
You may want to run two separate state machines simultaneously, and you would need to modify them so they terminate on the first match.
I took a stab at the modifications, starting with this python implementation
class AhoNode(object):
def __init__(self):
self.goto = {}
self.count = 0
self.fail = None
def aho_create_forest(patterns):
root = AhoNode()
for path in patterns:
node = root
for symbol in path:
node = node.goto.setdefault(symbol, AhoNode())
node.count += 1
return root
def aho_create_statemachine(patterns):
root = aho_create_forest(patterns)
queue = []
for node in root.goto.itervalues():
queue.append(node)
node.fail = root
while queue:
rnode = queue.pop(0)
for key, unode in rnode.goto.iteritems():
queue.append(unode)
fnode = rnode.fail
while fnode is not None and key not in fnode.goto:
fnode = fnode.fail
unode.fail = fnode.goto[key] if fnode else root
unode.count += unode.fail.count
return root
def aho_count_all(s, root):
total = 0
node = root
for i, c in enumerate(s):
while node is not None and c not in node.goto:
node = node.fail
if node is None:
node = root
continue
node = node.goto[c]
total += node.count
return total
def pattern_counter(patterns):
''' Returns an efficient counter function that takes a string
and returns the number of patterns matched within it
'''
machine = aho_create_statemachine(patterns)
def counter(text):
return aho_count_all(text, machine)
return counter
and to use it
patterns = ['CAT - cats are great', 'DOG - dogs are great too']
counter = pattern_counter(patterns)
text_list = ['CAT - cats are great(A)DOG - dogs are great too(B)',
'DOG - dogs are great too(B)']
for text in text_list:
print '%r - %s' % (text, counter(text))
This displays
'CAT - cats are great(A)DOG - dogs are great too(B)' - 2
'DOG - dogs are great too(B)' - 1
Note that this solution counts each match individually, so looking for 'a' and 'b' in 'aba' would give 3. If you wanted only one match per pattern, you would need to track all patterns seen, you would make a minor modification to convert the integer to a set:
- self.count = 0
+ self.seen = set()
...
- node.count += 1
+ node.seen.add(path)
...
- unode.count += unode.fail.count
+ unode.seen |= unode.fail.seen
...
- total = 0
+ all_seen = set()
- total += node.count
+ all_seen |= node.seen
- return total
+ return len(all_seen)