7

Typically in our work we use regular expressions in capture or match operations.

However, regular expressions can be used - manually at least - to generate legal sentences that match the regular expression. Of course, some regular expressions can match infinitely long sentences, e.g., the expression .+.

I have a problem that could be solved by using a regular expression sentence generating algorithm.

In pseudocode, it would operate something like this:

re = generate("foo(bar|baz)?", max_match = 100);  #Don't give me more than 100 results
assert re == ("foobar", "foobaz", "foo");

What algorithm would perform this for me?

Paul Nathan
  • 39,638
  • 28
  • 112
  • 212
  • I know how to do this easily with a given search string and agiven pattern. Is that good enough? If so, tell me and I’ll show you. You are very smart to give it an upper bound, too. I can do that. But there are infinitely many strings otherwise, so I don’t know how you would do that, although Bart Miller’s “fuzz testing” might perhaps apply, wherein he generates random input to feed programs to see whether that causes them to fail spectacularly. – tchrist Nov 17 '10 at 20:45
  • @tchrist: I can generate random garbage quite nicely. I would like to do something just like the above example shows. My rummaging shows that the Perl module String::Random is very like Xeger, but doesn't support (|). Xeger itself just walks the automata that the regex describes. That appears to be a common case. I read that Haskell has a regexp module that generates, I'm digging on that atm. – Paul Nathan Nov 17 '10 at 20:52
  • Couldn't find the haskell regexp module. :-/ – Paul Nathan Nov 17 '10 at 21:26

2 Answers2

2

Microsoft has a SMT-based gratis (MSRL-licensed) "Rex" tool for this: http://research.microsoft.com/en-us/downloads/7f1d87be-f6d9-495d-a699-f12599cea030/

From the Introduction section of the "Rex: Symbolic Regular Expression Explorer" paper:

We translate (extended) regular expressions or regexes [5] into a symbolic representation of finite automata called SFAs. In an SFA, moves are labeled by formulas representing sets of characters rather than individual characters. An SFA A is translated into a set of (recursive) axioms that describe the acceptance condition for the strings accepted by A and build on the representation of strings as lists.

As the SMT solver can output all possible solutions within some size bound, this may be close to what you're looking for.

On a more statistical and less formal front, the Regexp::Genex module from CPAN can work as well: http://search.cpan.org/dist/Regexp-Genex/

You can use it with something like this:

#!/usr/bin/env perl
use Regexp::Genex ':all';
my $hits = 100;
my $re = qr/[a-z](123|456)/;
local $Regexp::Genex::DEFAULT_LEN = length $re;
my %seen;
while ((time - $^T) < 2) {
    @seen{strings($re)} = ();
    $Regexp::Genex::DEFAULT_LEN++;
}
print "$_\n" for (sort %seen)[0..$hits-1];

Adjust the time and sample size as needed. Hope this helps!

audreyt
  • 116
  • 1
  • 3
  • I've just implemented another "output all possible solutions" tool at https://github.com/audreyt/regex-genex with the yices2 SMT solver. Might be useful as well. :-) – audreyt May 23 '11 at 15:41
  • The `Rex` project's research page seems to have moved to here: https://www.microsoft.com/en-us/research/project/rex-regular-expression-exploration/ – tykom Dec 02 '21 at 20:40
1

Take a look at Xeger (Google Code).

The Visual Studio Team System appears to have an inverse regex generator, too, but it doesn't look like the algorithm is open source.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Hmm. That's a random generator. I would like *some* sort of confidence that, for a finite language that the regex describes, I can enumerate all words in the language (yes, I could generate some sort of statistical measure, which *might* work, but I'm not good enough at advanced stats to be confident in my confidence interval...). – Paul Nathan Nov 17 '10 at 20:31