4

Is there a way or an efficient library that allows for incremental regular expression matching in Java?

What I mean by that is, I would like to have an OutputStream that I can send a couple bytes at a time to and that keeps track of matching the data so far against a regular expression. If a byte is received that will cause this regular expression to definitely not match, I would like the stream to tell me so. Otherwise it should keep me informed about the current best match, if any.

I realize that this is likely to be an extremely difficult and not well defined problem, since one can imagine regular expressions that can match a whole expression or any part of it or not have a decision until the stream is closed anyways. Even something as trivial as .* can match H, He, Hel, Hell, Hello, and so forth. In such a case, I would like the stream to say: "Yes, this expression could match if it was over now, and here are the groups it would return."

But if Pattern internally steps through the string it matches character by character, it might not be so hard?

tshepang
  • 12,111
  • 21
  • 91
  • 136
Markus A.
  • 12,349
  • 8
  • 52
  • 116
  • Actually, backtracking is the norm in regexp evaluation. Your intuition that this would be ill-defined is absolutely spot-on. – Marko Topolnik Oct 09 '12 at 16:52
  • @MarkoTopolnik I guess you could use backtracking and still process the characters in order, though, no? Or do regex engines jump around in the string to do "random" look-aheads? – Markus A. Oct 09 '12 at 17:00
  • A lookahead may require the entire input sequence to be examined, without actually matching anything. – Marko Topolnik Oct 09 '12 at 17:20
  • I think Scanner class in Java does something similar. It uses Pattern/Matcher class to process input. Maybe you can look at the source code to see how it uses Matcher class? (Matcher class has hitEnd and requireEnd method, which seems to cater to such usage). – nhahtdh Jan 11 '14 at 17:28
  • Related to (or maybe even a duplicate of) https://stackoverflow.com/q/3013669 – Marcono1234 Jun 27 '21 at 15:50

1 Answers1

1

Incremental matching can be nicely achieved by computing the finite state automaton corresponding to a regular expression, and performing state transitions on that while processing the characters of the input. Most lexers work this way. This approach won't work well for groups, though.

So perhaps you could make this two parts: have one matcher which figures out whether there is any match at all, or any chance of a match in the future. You can use that to give you a quick reply after every input character. Once you have a complete match, you can exucte a backtracking and grouping regular expression engine to identify your matching groups. In some cases, it might be feasible to encode the grouping stuff into the automaton as well, but I can't think of a generic way to accomplish this.

MvG
  • 57,380
  • 22
  • 148
  • 276
  • The FSM models only the basic subset of a modern regexp language. – Marko Topolnik Oct 09 '12 at 17:21
  • This should work great in my case, will just be a bit of work. I was hoping there was some available function stashed away somewhere. But I guess it never hurts to implement something yourself to make sure you fully understand it. – Markus A. Oct 14 '12 at 03:22