49

I want to reverse a regular expression. I.e. given a regular expression, I want to produce any string that will match that regex.

I know how to do this from a theoretical computer science background using a finite state machine, but I just want to know if someone has already written a library to do this. :)

I'm using Python, so I'd like a Python library.

To reiterate, I only want one string that will match the regex. Things like "." or ".*" would make an infinite amount of strings match the regex, but I don't care about all options.

I'm willing for this library to only work on a certain subset of regex.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Amandasaurus
  • 58,203
  • 71
  • 188
  • 248
  • 1
    I assume you want one arbitrary string, and not all strings...right? Otherwise, the moment you include a *, you get an infinite list. What is it you want to do this for? It might be easier to answer, that way. – user54650 Jan 29 '09 at 18:08
  • **See also:** https://stackoverflow.com/questions/4627464/generate-a-string-that-matches-a-regex-in-python (duplicate question) – dreftymac Nov 11 '22 at 22:29

8 Answers8

28

Somebody else had a similar (duplicate?) question here, and I'd like to offer a little helper library for generating random strings with Python that I've been working on.

It includes a method, xeger() that allows you to create a string from a regex:

>>> import rstr
>>> rstr.xeger(r'[A-Z]\d[A-Z] \d[A-Z]\d')
u'M5R 2W4'

Right now, it works with most basic regular expressions, but I'm sure it could be improved.

bjmc
  • 2,970
  • 2
  • 32
  • 46
  • 1
    Do you have any plans to package this so it's available via pip? – Andy Hayden May 26 '13 at 11:04
  • I used this for [a library I'm working on](https://github.com/hayd/pyfaker/blob/master/pyfaker/xeger.py), I had to tweak a couple of things to get it working in py3. Thanks very much for this nice tool. – Andy Hayden May 27 '13 at 19:56
  • It's been released on PyPI now, so you should be able to install it with pip. If you don't mind making a pull request with your Py3 changes (are they backwards-compatible?) I'd love to include them. – bjmc Jun 06 '13 at 15:28
  • Andy, if you haven't already seen it, you should check out [python-faker](https://github.com/redneckbeard/python-faker) which is similar to your library (though not under active development as far as I can tell). – bjmc Jun 06 '13 at 15:30
  • 1
    Thanks, this [one](https://github.com/joke2k/faker) seems to be the best imo, but again not much development going on. Changes are backwards compatible, I wasn't sure how to make a pr in bitbucket, but I can give it a try later (can be seen [here](https://github.com/hayd/pyfaker/commit/098bd2b3356482a0d0c98dbc5b52c06762a4e9ef#pyfaker/xeger.py), actually you could keep using xrange and add `xrange=range` in the if block too). – Andy Hayden Jun 06 '13 at 15:39
  • [**xeger**](https://github.com/crdoconnor/xeger) is the better **rstr** this time. To me, the key limitation of rstr is a lack of limitation for the xeger output. – Wolf Sep 17 '20 at 10:35
  • Another option is [intxeger](https://github.com/k15z/IntXeger) which is much faster and provides some extra features such as array-like indexing. – Kevin Alex Zhang Mar 13 '21 at 03:19
19

Although I don't see much sense in this, here goes:

import re
import string

def traverse(tree):
    retval = ''
    for node in tree:
        if node[0] == 'any':
            retval += 'x'
        elif node[0] == 'at':
            pass
        elif node[0] in ['min_repeat', 'max_repeat']:
            retval += traverse(node[1][2]) * node[1][0]
        elif node[0] == 'in':
            if node[1][0][0] == 'negate':
                letters = list(string.ascii_letters)
                for part in node[1][1:]:
                    if part[0] == 'literal':
                        letters.remove(chr(part[1]))
                    else:
                        for letter in range(part[1][0], part[1][1]+1):
                            letters.remove(chr(letter))
                retval += letters[0]
            else:
                if node[1][0][0] == 'range':
                    retval += chr(node[1][0][1][0])
                else:
                    retval += chr(node[1][0][1])
        elif node[0] == 'not_literal':
            if node[1] == 120:
                retval += 'y'
            else:
                retval += 'x'
        elif node[0] == 'branch':
            retval += traverse(node[1][1][0])
        elif node[0] == 'subpattern':
            retval += traverse(node[1][1])
        elif node[0] == 'literal':
            retval += chr(node[1])
    return retval

print traverse(re.sre_parse.parse(regex).data)

I took everything from the Regular Expression Syntax up to groups -- this seems like a reasonable subset -- and I ignored some details, like line endings. Error handling, etc. is left as an exercise to the reader.

Of the 12 special characters in a regex, we can ignore 6 completely (2 even with the atom they apply to), 4.5 lead to a trivial replacement and 1.5 make us actually think.

What comes out of this is not too terribly interesting, I think.

  • 1
    How about 'assert' nodes? e.g. `r'(?=[a-z])(?=[x-z])'` – Vegard Mar 29 '14 at 21:15
  • I know the point wasn't to give a full solution. I was just asking if you could sketch the solution for 'assert' nodes as well. If you still feel I missed the point, could you please explain why my comment was so hopelessly off the mark? – Vegard Apr 01 '14 at 13:54
  • @Vegard: you miss the point that the exercise is absolutely pointless. for example: replace `(?=[a-z])` with `a`. done. not interesting. –  Apr 01 '14 at 17:13
  • Well, I think it's interesting because with one or more `(?=...)` you now need to find something that matches two or more regexes simultaneously. For the example I gave, you can only select letters in the range `[x-z]` since that's the overlapping range. So it's not as straightforward anymore. – Vegard Apr 01 '14 at 21:45
  • @Vegard: give me a real-life example where your regex makes any sense. –  Apr 01 '14 at 22:34
  • My real-life application is actually a translation project where a several documents are split into "texts" and "libraries"; the library contains strings and their translations, whereas the texts select strings from the library. To save typing, the texts can be patterns like ("I saw him ... the house." and it will select things like "I saw him enter the library." (and its translation) from the library). Now, this book in particular has many translations which are similar, so I'd like the library to contain entries like: "EN: I saw him enter the (?Phouse|shop)." and corresponding – Vegard Apr 02 '14 at 08:49
  • ...translation, e.g. "NO: Jeg så ham gå inn i (?Phuset|butikken).". Now matching a string in the text against a string in the library actually becomes a matter of finding a string that matches *both* the string from the text and the pattern in the library. To do this, you would have to find the intersection of the strings that match the text pattern and the library pattern. Which you can also do by finding solutions to a regex of the form `(?=text)(?=library)`. I tried to explain in the limited space I have here. It's a real-life application. Not saying it's necesssarily good. – Vegard Apr 02 '14 at 08:53
13

I don't know of any module to do this. If you don't find anything like this in the Cookbook or PyPI, you could try rolling your own, using the (undocumented) re.sre_parse module. This might help getting you started:

In [1]: import re

In [2]: a = re.sre_parse.parse("[abc]+[def]*\d?z")

In [3]: a
Out[3]: [('max_repeat', (1, 65535, [('in', [('literal', 97), ('literal', 98), ('literal', 99)])])), ('max_repeat', (0, 65535, [('in', [('literal', 100), ('literal', 101), ('literal', 102)])])), ('max_repeat', (0, 1, [('in', [('category', 'category_digit')])])), ('literal', 122)]

In [4]: eval(str(a))
Out[4]: 
[('max_repeat',
  (1, 65535, [('in', [('literal', 97), ('literal', 98), ('literal', 99)])])),
 ('max_repeat',
  (0,
   65535,
   [('in', [('literal', 100), ('literal', 101), ('literal', 102)])])),
 ('max_repeat', (0, 1, [('in', [('category', 'category_digit')])])),
 ('literal', 122)]

In [5]: a.dump()
max_repeat 1 65535
  in
    literal 97
    literal 98
    literal 99
max_repeat 0 65535
  in
    literal 100
    literal 101
    literal 102
max_repeat 0 1
  in
    category category_digit
literal 122
Hans Nowak
  • 7,600
  • 1
  • 18
  • 18
8

Exrex can create strings from regexes.

Exrex is a command line tool and python module that generates all - or random - matching strings to a given regular expression and more.

Example:

>>> exrex.getone('\d{4}-\d{4}-\d{4}-[0-9]{4}')
'3096-7886-2834-5671'
Sjoerd
  • 74,049
  • 16
  • 131
  • 175
5

While the other answers use the re engine to parse out the elements I have whipped up my own that parses the re and returns a minimal pattern that would match. (Note it doesn't handle [^ads], fancy grouping constructs, start/end of line special characters). I can supply the unit tests if you really like :)

import re
class REParser(object):
"""Parses an RE an gives the least greedy value that would match it"""

 def parse(self, parseInput):
    re.compile(parseInput) #try to parse to see if it is a valid RE
    retval = ""
    stack = list(parseInput)
    lastelement = ""
    while stack:
        element = stack.pop(0) #Read from front
        if element == "\\":
            element = stack.pop(0)
            element = element.replace("d", "0").replace("D", "a").replace("w", "a").replace("W", " ")
        elif element in ["?", "*"]:
            lastelement = ""
            element = ""
        elif element == ".":
            element = "a"
        elif element == "+":
            element = ""
        elif element == "{":
            arg = self._consumeTo(stack, "}")
            arg = arg[:-1] #dump the }     
            arg = arg.split(",")[0] #dump the possible ,
            lastelement = lastelement * int(arg)
            element = ""
        elif element == "[":
            element = self._consumeTo(stack, "]")[0] # just use the first char in set
            if element == "]": #this is the odd case of []<something>]
                self._consumeTo(stack, "]") # throw rest away and use ] as first element
        elif element == "|":
            break # you get to an | an you have all you need to match
        elif element == "(":
            arg = self._consumeTo(stack, ")")
            element = self.parse( arg[:-1] )

        retval += lastelement
        lastelement = element
    retval += lastelement #Complete the string with the last char

    return retval

 def _consumeTo(self, stackToConsume, endElement ):
    retval = ""
    while not retval.endswith(endElement):
        retval += stackToConsume.pop(0)
    return retval
Andrew Cox
  • 10,672
  • 3
  • 33
  • 38
4

Unless your regex is extremely simple (i.e. no stars or pluses), there will be infinitely many strings which match it. If your regex only involves concatenation and alternation, then you can expand each alternation into all of its possibilities, e.g. (foo|bar)(baz|quux) can be expanded into the list ['foobaz', 'fooquux', 'barbaz', 'barquux'].

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
3

The hypothesis package has a strategy from_regex which does exactly that:

>>> from hypothesis import strategies as st
>>> regex_strategy = st.from_regex(r'[0-9]{4}-[0-9]{8}-[0-9]{16}', fullmatch=True)
>>> regex_strategy.example()
'5000-00000000-0000000000000000'
>>> regex_strategy.example()
'8934-72779254-0542308893797380'
>>> regex_strategy.example()
'0000-00000000-0000000000000000'
>>> regex_strategy.example()
'9000-00000000-0000000000000000'
>>> regex_strategy.example()
'2791-03881838-5840296980736661'

Beware though that hypothesis is a library for fuzzy testing and not just a simple data generator. So caution is advised: if you e.g. modify the pattern in the above example to \d{4}-\d{8}-\d{16}, a generated example could be something like '۵෫൮۹-๕꯱౦໓᠘௮᭒৮-꯳೩꘥៦६༣߃੮8༡႑۹᪒٩꧶' which may seem unexpected yet matches the pattern.

hoefling
  • 59,418
  • 12
  • 147
  • 194
2

I haven't seen a Python module to do this, but I did see a (partial) implementation in Perl: Regexp::Genex. From the module description, it sounds like the implementation relies on internal details of Perl's regular expression engine, so it may not be useful even from a theoretical point of view (I haven't investigated the implementation, just going by the comments in the documentation).

I think doing what you propose in general is a hard problem and may require the use of nondeterministic programming techniques. A start would be to parse the regular expression and build a parse tree, then traverse the tree and build sample string(s) as you go. Challenging bits will probably be things like backreferences and avoiding infinite loops in your implementation.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • I've come across genex in my searches and it's the sorta thing I want to do. I'm happy to limit my regexes to only a subset of regexes. – Amandasaurus Jan 29 '09 at 18:20
  • *`build a parse tree, then traverse the tree and build sample string(s) as you go`* interesting POV, really. What I thought: is this really just "traversal" or more like "enter, then try to escape"? – Wolf May 02 '16 at 10:59