How to generate random strings that match a given regexp?

Question

Duplicate:

Random string that matches a regexp

No, it isn't. I'm looking for an easy and universal method, one that I could actually implement. That's far more difficult than randomly generating passwords.

I want to create an application that takes a regular expression, and shows 10 randomly generated strings that match that expression. It's supposed to help people better understand their regexps, and to decide i.e. if they're secure enough for validation purposes. Does anyone know of an easy way to do that?

One obvious solution would be to write (or steal) a regexp parser, but that seems really over my head.

I repeat, I'm looking for an easy and universal way to do that.

Edit: Brute force approach is out of the question. Assuming the random strings would just be [a-z0-9]{10} and 1 million iterations per second, it would take 65 years to iterate trough the space of all 10-char strings.

I don't think there's going to be an easy way to do this... maybe the mechanical turk? :) — Greg, Apr 14 '09 at 16:12
Do you have a particular regex in mind, or are you after a general solution for any regex variant? Because you're not going to find one that works for Perl as well as .NET unless you restrict yourself to truly regular expressions without any extensions. — Welbog, Apr 14 '09 at 16:24
Well, I would _like_ a general solution for a single variant, most notably the one I use, Perl Regular Expressions implementation in PHP. — Michał Tatarynowicz, Apr 14 '09 at 17:23
In general, the problem is #P-hard. https://www.researchgate.net/publication/220780342_Counting_and_Random_Generation_of_Strings_in_Regular_Languages — , Dec 13 '15 at 18:49
See also [Given a regular expression, how would I generate all strings that match it?](https://stackoverflow.com/questions/20080789/given-a-regular-expression-how-would-i-generate-all-strings-that-match-it) — Sjoerd, Apr 12 '17 at 18:58

score 24 · Accepted Answer · edited Apr 14 '09 at 16:07

24

Parse your regular expression into a DFA, then traverse your DFA randomly until you end up in an accepting state, outputting a character for each transition. Each walk will yield a new string that matches the expression.

This doesn't work for "regular" expressions that aren't really regular, though, such as expressions with backreferences. It depends on what kind of expression you're after.

edited Apr 14 '09 at 16:07

Richard Ev

52,939
59
191
278

answered Apr 14 '09 at 16:00

Welbog

59,154
9
110
123

@Richard E: Deterministic finite automaton – Brian Apr 14 '09 at 16:03
@Richard E: Deterministic Finite Automata: http://en.wikipedia.org/wiki/Deterministic_finite_state_machine Basically it's the implementation of a regular expression. When you compile a regex, a DFA is the result. – Welbog Apr 14 '09 at 16:03
@Richard E., deterministic finite automata? – Rob Wells Apr 14 '09 at 16:03
given an arbitrary regexp the resulting dfa can be very large, so simply traverse dfa randomly can end up in a loop... – dfa Apr 14 '09 at 16:10
This doesn't sound easy at all :) – Michał Tatarynowicz Apr 14 '09 at 16:12
1

@DFA: If you end up in a non-accepting branch of the DFA from which no transitions end in accepting states, then you'll have to start over. Obviously if such a branch exists it would have to be trimmed out of the set of states somehow. It should be simple enough to use graph algorithms to find them. – Welbog Apr 14 '09 at 16:13
1

@Pies: This is how regular expressions work. Even if you find a library that does it for you, this is probably how it works. It does exactly what you need of it: traverse the structure the regex represents, but in reverse; producing a string rather than consuming one. – Welbog Apr 14 '09 at 16:15
@welbog: absolutely, I was only warn about this problem. Your answer is The Universal One :))) – dfa Apr 14 '09 at 16:15
Note that it is also possible to end up in a valid, infinite loop. A classic example would be looking for a match for .* . It will start by generating a, then aa, then aaa, etc. Obviously you'll want to try to generate the shortest strings first to avoid this kind of thing. – Brian Apr 14 '09 at 18:49
@Brian: this is why I suggest stopping once you reach an accepting state, and trimming branches that can't reach accepting states. – Welbog Apr 14 '09 at 19:22
My idea would be to simply treat .* as a random number of random characters. There would have to be some limits of course (like * generating at most a 1000 chars.) – Michał Tatarynowicz May 04 '09 at 12:47
@Brian: or you could generate a NFA instead of a DFA, with (x) probability of repeating at each * node, and (1-x) probability of continuing through the regex. – Dominic Scheirlinck May 14 '09 at 05:14

score 7 · Answer 2 · answered Apr 14 '09 at 16:04

7

Take a look at Perl's String::Random.

answered Apr 14 '09 at 16:04

moonshadow

86,889
7
82
122

I don't suppose you know a similar thing for PHP? – Michał Tatarynowicz Apr 14 '09 at 16:13
Write it in Perl, compile it with some Perl-to-executable tool, then invoke it from PHP. – Thomas L Holaday Apr 14 '09 at 16:14
The internet is a series of tubes. – Thomas L Holaday Apr 14 '09 at 16:24
Yeah, I guess it's just easier to deploy if you use a single language :) – Michał Tatarynowicz Apr 14 '09 at 16:42
Perl's String::Random only supports a small subset of regexp, so I'll have to look for something better. – Michał Tatarynowicz Apr 14 '09 at 17:14
Try [this PHP library](https://github.com/icomefromthenet/ReverseRegex). Seems to support more regex formats that String::Random. – Tamlyn Apr 12 '13 at 10:26
And another [PHP lib](https://github.com/sam-at-github/language-generator/blob/master/README.md) – spinkus Feb 07 '16 at 04:15

score 0 · Answer 3 · answered Apr 14 '09 at 18:53

One rather ugly solution that may or may not be practical is to leverage an existing regex diagnostics option. Some regex libraries have the ability to figure out where the regex failed to match. In this case, you could use what is in effect a form of brute force, but using one character at a time and trying to get longer (and further-matching) strings until you got a full match. This is a very ugly solution. However, unlike a standard brute force solution, it failure on a string like ab will also tell you whether there exists a string ab.* which will match (if not, stop and try ac. If so, try a longer string). This is probably not feasible with all regex libraries.

On the bright side, this kind of solution is probably pretty cool from a teaching perspective. In practice it's probably similar in effect to a dfa solution, but without the requirement to think about dfas.

Note that you won't want to use random strings with this technique. However, you can use random characters to start with if you keep track of what you've tested in a tree, so the effect is the same.

Interesting idea, I'll check it out. – Michał Tatarynowicz Apr 14 '09 at 23:01 — Michał Tatarynowicz, Apr 14 '09 at 23:01

score -1 · Answer 4 · answered Apr 14 '09 at 16:03

-1

if your only criteria are that your method is easy and universal, then there ain't nothing easier or more universal than brute force. :)

for (i = 0; i < 10; ++i) {
    do {
        var str = generateRandomString();
    } while (!myRegex.match(str));
    myListOfGoodStrings.push(str);
}

Of course, this is a very silly way to do things and mostly was meant as a joke.

I think your best bet would be to try writing your own very basic parser, teaching it just the things which you're expecting to encounter (eg: letter and number ranges, repeating/optional characters... don't worry about look-behinds etc)

answered Apr 14 '09 at 16:03

nickf

537,072
198
649
721

1

I want an algorithm that will actually finish running before the end of time. I want it to run below 1 second, for sure. – Michał Tatarynowicz Apr 14 '09 at 16:17
heh well you should be more specific :p anyway, i've addressed the question with that in mind in the second part of my answer. – nickf Apr 14 '09 at 16:47
Writing my own parser is exactly the kind of thing I want to avoid here :) – Michał Tatarynowicz Apr 14 '09 at 17:01

score -2 · Answer 5 · edited May 23 '17 at 12:17

-2

The universality criterion is impossible. Given the regular expression "^To be, or not to be -- that is the question:$", there will not be ten unique random strings that match.

For non-degenerate cases:

moonshadow's link to Perl's String::Random is the answer. A Perl program that reads a RegEx from stdin and writes the output from ten invocations of String::Random to stdout is trivial. Compile it to either a Windows or Unix exe with Perl2exe and invoke it from PHP, Python, or whatever.

Also see Random Text generator based on regex

edited May 23 '17 at 12:17

Community

1
1

answered Apr 14 '09 at 16:10

Thomas L Holaday

13,614
6
40
51

The example you gave is certainly not a degenerate case for me, but rather, a very easy one. The first character matches only 'T', second matches only 'o', and so forth. If all the versions are the same, so be it. – Michał Tatarynowicz Apr 14 '09 at 16:38

How to generate random strings that match a given regexp?

Duplicate:

5 Answers5

Linked