0

This is a follow-up to my previous question.

Suppose I want to generate all strings that match a given (simplified) regular expression.

It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.

I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?

Community
  • 1
  • 1
Michael
  • 10,185
  • 12
  • 59
  • 110
  • The moment you put a `*` in, you would produce infinite (and terribly boring) output. – Dave Jan 04 '12 at 16:13
  • Ok, let's assume it will be a very restricted grammar (`a`, `b`, `sequence`, `alternate`, `zero or more`) – Michael Jan 04 '12 at 16:17
  • You could still use `*` in the regex, you'd just have to have a more complex algorithm that wouldn't get stuck on repeatedly adding characters. For example if you had `[a-z]*[ab]` you could generate `a b aa ab ba bb ca cb ... za zb aaa aab...` – Dan Simon Jan 04 '12 at 16:42

1 Answers1

4

The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.

First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,

List(
  Constant("abc"),
  ZeroOrOne(Constant("d")),
  Constant("efg"),
  OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
  Constant("qrs"),
  AnyChar()
)

Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.

Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.

Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • Thank you. Would you use a `stream`, `iterator` or none of them to implement the "word generator"? – Michael Jan 04 '12 at 16:52
  • I'd probably use none of them, but it could be easily wrapped in an `Iterator` or a `Stream`. – Rex Kerr Jan 04 '12 at 19:04
  • If he treats the problem as a graph generation, a breadth first approach can generate an infinite sequence without limiting number of repetitions. – Daniel C. Sobral Jan 04 '12 at 21:54
  • @DanielC.Sobral - True enough, though I was considering the arbitrarily complex case of nested groups where, while you can come up with a diagonalization strategy, you don't necessarily produce the interesting part of the regex in much depth before being overwhelmed by the boring stuff. In the simple case, I agree--this is likely a good way to go. – Rex Kerr Jan 04 '12 at 22:48