Problem: The program needs to read from an input stream of characters, and match the input against a predefined list of regex patterns. The program should report a match as soon as it is found (report all if there are multiple), or report failure after the whole stream is consumed without any match. Input may not come all at once, so whenever the program calls read
on the input stream, it's possible to read zero, one or many characters.
The trivial solution would be buffering all the characters that have been read so far, and try to match it against the whole group of regex patterns. But this is not very efficient. Regex matching is inherently stateful, so there should be a method without buffering the input. Ideally there is a one-pass solution using a technique similar to the Aho-Corasick for plain string matching. I've searched for existing libraries but didn't find one capable for such incremental regex matching. Which libraries or algorithms could help (C, C++, Python or Perl libraries preferred)?
One existing solution forms a huge regex pattern by or-ing small regex patterns. But I'm not sure if this is faster than iterating the small regex patterns. And this is not incremental match: Every match requires the whole buffered input.
Not a duplicate of this: Incremental Pattern (RegEx) matching in Java? The accepted answer there shows a direction but no concrete solution. Other users have reported very large memory footprint resulting from the merged FSA, but that answer doesn't mention how to optimize to make it practical.
A Python skeleton to help formulate the problem:
regex_list = [ r'0+', r'1+', r'2+', r'3+' ]
class Matcher:
def match(self):
...
buffer += sys.stdin.read()
...
return ([ matched_regex, ... ], matched_token)
matcher = Matcher()
while True:
matched_regex_list, token = matcher.match()