1

Problem: The program needs to read from an input stream of characters, and match the input against a predefined list of regex patterns. The program should report a match as soon as it is found (report all if there are multiple), or report failure after the whole stream is consumed without any match. Input may not come all at once, so whenever the program calls read on the input stream, it's possible to read zero, one or many characters.

The trivial solution would be buffering all the characters that have been read so far, and try to match it against the whole group of regex patterns. But this is not very efficient. Regex matching is inherently stateful, so there should be a method without buffering the input. Ideally there is a one-pass solution using a technique similar to the Aho-Corasick for plain string matching. I've searched for existing libraries but didn't find one capable for such incremental regex matching. Which libraries or algorithms could help (C, C++, Python or Perl libraries preferred)?

One existing solution forms a huge regex pattern by or-ing small regex patterns. But I'm not sure if this is faster than iterating the small regex patterns. And this is not incremental match: Every match requires the whole buffered input.

Not a duplicate of this: Incremental Pattern (RegEx) matching in Java? The accepted answer there shows a direction but no concrete solution. Other users have reported very large memory footprint resulting from the merged FSA, but that answer doesn't mention how to optimize to make it practical.


A Python skeleton to help formulate the problem:

regex_list = [ r'0+', r'1+', r'2+', r'3+' ]
class Matcher:
    def match(self):
        ...
        buffer += sys.stdin.read()
        ...
        return ([ matched_regex, ... ], matched_token)
matcher = Matcher()
while True:
    matched_regex_list, token = matcher.match()
Cyker
  • 9,946
  • 8
  • 65
  • 93
  • Check Java's Scanner source for inspiration. It's doing exactly what you are trying to do here. Well, you would need to read source code of Pattern as well, since it keeps track of whether something can be considered a match without seeing what comes next. – nhahtdh Jan 31 '19 at 06:01
  • @nhahtdh OK will do. Methods like `nextDouble` could possibly help, but I'm wondering how difficult it is to port that to a different language. Furthermore, here is a difference: this program also needs to report *which* regex is matched (if any). – Cyker Jan 31 '19 at 06:06
  • I happen to come accross the very same problem only a couple of months later. Any update on this? – iago-lito Apr 07 '19 at 09:18

0 Answers0