Randomly generate a string that does NOT match a given regular expression

Question

For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:

^[abcd]d+

Then I should be able to generate strings such as:

hnbbad
uduebbaef
9f8;djfew
skjcc98332f

...each of which does NOT match the regex, but NOT generate:

addr32
bdfd09usdj
cdddddd-9fdssee

...each of which DO. In other words, I want something like an anti-Xeger.

Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.

possible duplicate of [Reversing a regular expression in python](http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python) — Bart Kiers, Nov 09 '12 at 22:04

Mark Byers · Accepted Answer · 2012-11-09T22:42:22.677

My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.

For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:

(?!(..+)\1+$).{10000}

I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007 will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.

I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.

But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.

Luckily this is possible using a negative lookahead:

^(?![\s\S]*(?:^[abcd]d+))

If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+ becomes ^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.

Hm, yeah, I was somewhat afraid of that, really. But, ultimately, in this case, I can probably assure that the regexes we'll be supplied will be simple enough to negate by boolean logic, so that idea should help. Thanks anyway! — CaptainSpam, Nov 10 '12 at 23:08
@MarkByers: It's always impressive to see a master regexer at work. I love this -> _The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:_ — Mike, Sep 27 '17 at 23:44

score 3 · Answer 2 · answered Nov 09 '12 at 22:06

3

I would do a loop, generating random combinations of random length, and test if matches the regexp. Repeat the loop until a not-match situation is reached.

Obviously, this would be inefficient. Are you sure you cannot invert the regexp and generate a match on the inverted regexp?

answered Nov 09 '12 at 22:06

felixgaal

2,403
15
24

Actually, depending on the specificity of the regular expression, you may be quite likely to find a non-match in a very short time. For example, if you're trying to generate random text that is NOT an email, you probably have a good shot of getting it right on the first go. Nice solution! – morphatic Apr 06 '17 at 01:59

score 2 · Answer 3 · answered Nov 10 '12 at 06:28

No this is impossible. There are an infinite number of regexes that match every string in the known universe. For example:

/^/
/.*/
/[^"\\]*(\\.[^"\\]*)*$/

etc.

This is because all these regexes can match nothing at all (which is something all strings have!)

score 0 · Answer 4 · edited Apr 06 '13 at 14:50

0

Can we reduce the infinite number of possibilities, by restricting to generate strings from a give character set.

For example, I can define the character set, [QWERTYUIOP!@#$%^%^&*))_] and all the strings I generate randomly should be born from this set. That way we can reduce the infinite nature of this problem?

In fact even I am looking for a utility like this, preferably in Python.

edited Apr 06 '13 at 14:50

jnovack

7,629
2
26
40

answered Apr 06 '13 at 14:27

Priya Ananthasankar

1

Randomly generate a string that does NOT match a given regular expression

4 Answers4