7

For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:

^[abcd]d+

Then I should be able to generate strings such as:

hnbbad
uduebbaef
9f8;djfew
skjcc98332f

...each of which does NOT match the regex, but NOT generate:

addr32
bdfd09usdj
cdddddd-9fdssee

...each of which DO. In other words, I want something like an anti-Xeger.

Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.

SaidbakR
  • 13,303
  • 20
  • 101
  • 195
CaptainSpam
  • 127
  • 6
  • possible duplicate of [Reversing a regular expression in python](http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python) – Bart Kiers Nov 09 '12 at 22:04

4 Answers4

6

My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.

For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:

(?!(..+)\1+$).{10000}

I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007 will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.

I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.


But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.

Luckily this is possible using a negative lookahead:

^(?![\s\S]*(?:^[abcd]d+))

If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+ becomes ^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Hm, yeah, I was somewhat afraid of that, really. But, ultimately, in this case, I can probably assure that the regexes we'll be supplied will be simple enough to negate by boolean logic, so that idea should help. Thanks anyway! – CaptainSpam Nov 10 '12 at 23:08
  • 1
    @MarkByers: It's always impressive to see a master regexer at work. I love this -> _The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:_ – Mike Sep 27 '17 at 23:44
3

I would do a loop, generating random combinations of random length, and test if matches the regexp. Repeat the loop until a not-match situation is reached.

Obviously, this would be inefficient. Are you sure you cannot invert the regexp and generate a match on the inverted regexp?

felixgaal
  • 2,403
  • 15
  • 24
  • Actually, depending on the specificity of the regular expression, you may be quite likely to find a non-match in a very short time. For example, if you're trying to generate random text that is NOT an email, you probably have a good shot of getting it right on the first go. Nice solution! – morphatic Apr 06 '17 at 01:59
2

No this is impossible. There are an infinite number of regexes that match every string in the known universe. For example:

/^/
/.*/
/[^"\\]*(\\.[^"\\]*)*$/

etc.

This is because all these regexes can match nothing at all (which is something all strings have!)

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
0

Can we reduce the infinite number of possibilities, by restricting to generate strings from a give character set.

For example, I can define the character set, [QWERTYUIOP!@#$%^%^&*))_] and all the strings I generate randomly should be born from this set. That way we can reduce the infinite nature of this problem?

In fact even I am looking for a utility like this, preferably in Python.

jnovack
  • 7,629
  • 2
  • 26
  • 40