Reverse regular expression in Python

Question

this is a strange question I know... I have a regular expression like:

rex = r"at (?P<hour>[0-2][0-9]) send email to (?P<name>\w*):? (?P<message>.+)"

so if I match that like this:

match = re.match(rex, "at 10 send email to bob: hi bob!")

match.groupdict() gives me this dict:

{"hour": "10", "name": "bob", "message": "hi bob!"}

My question is: given the dict above and rex, can I make a function that returns the original text? I know that many texts can match to the same dict (in this case the ':' after the name is optional) but I want one of the infinite texts that will match to the dict in input.

Is it really infinite? Other than the optional ':', everything else is fixed right? — Jayanth Koushik, Apr 23 '14 at 09:35
Short of using `match.group()` (a.k.a. `match.group(0)`), no. You're discarding information (in particular, whether the original string contained a colon or not), so there's no way of definitively reconstructing the original string just from the contents of the captured groups. The only way is to add a capture group for the colon, which you can then use to determine whether the input text contained a colon or not. — Mac, Apr 23 '14 at 09:35
I gave an incorrect answer... The point is that the regex looses some data, thus if you want to restore the input you need to capture the whole data (in different tokens) — Emilien, Apr 23 '14 at 09:42
@JayanthKoushik yes is infinite because I can have " +" between two word, so every sentence with one or more spaces matches. — Matteo, Apr 23 '14 at 09:49
@Emilien ok I understand but I'm happy just to have one sentence (in the case above with or without ':' is the same) — Matteo, Apr 23 '14 at 09:51

unutbu · Answer 1 · 2014-04-24T11:14:34.990

Using inverse_regex:

"""
http://www.mail-archive.com/python-list@python.org/msg125198.html
"""
import itertools as IT
import sre_constants as sc
import sre_parse
import string

# Generate strings that match a given regex

category_chars = {
    sc.CATEGORY_DIGIT : string.digits,
    sc.CATEGORY_SPACE : string.whitespace,
    sc.CATEGORY_WORD  : string.digits + string.letters + '_'
    }

def unique_extend(res_list, list):
    for item in list:
        if item not in res_list:
            res_list.append(item)

def handle_any(val):
    """
    This is different from normal regexp matching. It only matches
    printable ASCII characters.
    """
    return string.printable

def handle_branch((tok, val)):
    all_opts = []
    for toks in val:
        opts = permute_toks(toks)
        unique_extend(all_opts, opts)
    return all_opts

def handle_category(val):
    return list(category_chars[val])

def handle_in(val):
    out = []
    for tok, val in val:
        out += handle_tok(tok, val)
    return out

def handle_literal(val):
    return [chr(val)]

def handle_max_repeat((min, max, val)):
    """
    Handle a repeat token such as {x,y} or ?.
    """
    subtok, subval = val[0]

    if max > 5000:
        # max is the number of cartesian join operations needed to be
        # carried out. More than 5000 consumes way to much memory.
        # raise ValueError("To many repetitions requested (%d)" % max)
        max = 5000

    optlist = handle_tok(subtok, subval)

    iterlist = []
    for x in range(min, max + 1):
        joined = IT.product(*[optlist]*x) 
        iterlist.append(joined)

    return (''.join(it) for it in IT.chain(*iterlist))

def handle_range(val):
    lo, hi = val
    return (chr(x) for x in range(lo, hi + 1))

def handle_subpattern(val):
    return list(permute_toks(val[1]))

def handle_tok(tok, val):
    """
    Returns a list of strings of possible permutations for this regexp
    token.
    """
    handlers = {
        sc.ANY        : handle_any,
        sc.BRANCH     : handle_branch,
        sc.CATEGORY   : handle_category,
        sc.LITERAL    : handle_literal,
        sc.IN         : handle_in,
        sc.MAX_REPEAT : handle_max_repeat,
        sc.RANGE      : handle_range,
        sc.SUBPATTERN : handle_subpattern}
    try:
        return handlers[tok](val)
    except KeyError, e:
        fmt = "Unsupported regular expression construct: %s"
        raise ValueError(fmt % tok)

def permute_toks(toks):
    """
    Returns a generator of strings of possible permutations for this
    regexp token list.
    """
    lists = [handle_tok(tok, val) for tok, val in toks]
    return (''.join(it) for it in IT.product(*lists))



########## PUBLIC API ####################

def ipermute(p):
    return permute_toks(sre_parse.parse(p))

You could apply the substitutions given rex and data, and then use inverse_regex.ipermute to generate strings that match the original regex:

import re
import itertools as IT
import inverse_regex as ire

rex = r"(?:at (?P<hour>[0-2][0-9])|today) send email to (?P<name>\w*):? (?P<message>.+)"
match = re.match(rex, "at 10 send email to bob: hi bob!")
data = match.groupdict()
del match

new_regex = re.sub(r'[(][?]P<([^>]+)>[^)]*[)]', lambda m: data.get(m.group(1)), rex)
for s in IT.islice(ire.ipermute(new_regex), 10):
    print(s)

yields

today send email to bob hi bob!
today send email to bob: hi bob!
at 10 send email to bob hi bob!
at 10 send email to bob: hi bob!

Note: I modified the original inverse_regex to not raise a ValueError when the regex contains *s. Instead, the * is changed to be effectively like {,5000} so you'll at least get some permutations.

Thank you very much, this is what I was looking for! But I have one question, this doesn't handle nested brackets, for example (?:at (?P[0-2][0-9])|today).. is there a solution? — Matteo, Apr 23 '14 at 12:35
That happens because this reg. ex. "[(][?]P<([^>]+)>[^)]*[)]" stops when it finds a ')' but in the middle it was a '(?....', any solution to skip many ')' as many '(' are in the middle? — Matteo, Apr 23 '14 at 12:50
That's a really interesting problem. You might be able to deal with the nested parentheses [using (or modifying) this](http://stackoverflow.com/a/23185606/190597), but I don't have a ready answer. As a form of play I enjoy problems like this, but from a practical point of view, I think perhaps you might be pursuing an [XY Problem](http://meta.stackexchange.com/q/66377/137631) -- if you asked a question about your larger goal, someone here may be able to suggest a strategy that avoids this complication. — unutbu, Apr 23 '14 at 14:13
I read about the XY problem and it is not like that: I commented the other answer with this info that you might have not read: To explain my problem better, I would like to store in ONE concise variable the information needed in 2 functions (encode and decode) to translate the string in dict and the dict in string. The regular expression itself work perfectly in "encode" because that with the string makes the dict, but for the decode? --- thank you anyway, if you have any other idea please tell me! :) — Matteo, Apr 23 '14 at 17:47
What is the advantage of storing the regex and the dict instead of the regex and the string? Are you trying to save space? or is there some other purpose? — unutbu, Apr 23 '14 at 18:28

score 0 · Answer 2 · answered Apr 23 '14 at 09:33

0

This is one of the texts that will match the regex:

'at {hour} send email to {name}: {message}'.format(**match.groupdict())'

answered Apr 23 '14 at 09:33

Jayanth Koushik

9,476
1
44
52

Or, more idiomatically, `match.expand(r'at \g send email to \g: \g')` – Mac Apr 23 '14 at 09:39
`\g` not `\\`, but yes, that is more idiomatic. – Jayanth Koushik Apr 23 '14 at 09:42
1

But OP says 'using the groupdict'. If the match was available, you could just do `match.group()` – Jayanth Koushik Apr 23 '14 at 09:43
@JayanthKoushik because I have the regular expression I must translate rex to the string you mentioned above: 'at {hour} send email to {name}: {message}'. – Matteo Apr 23 '14 at 10:00
You mean without even using the groupdict? – Jayanth Koushik Apr 23 '14 at 10:01
To explain my problem better, I would like to store in ONE concise variable the information needed in 2 functions (encode and decode) to translate the string in dict and the dict in string. The regular expression itself work perfectly in "encode" because that with the string makes the dict, but for the decode? – Matteo Apr 23 '14 at 10:03
As the others have said: not possible, since you are discarding information. – Jayanth Koushik Apr 23 '14 at 10:04

Reverse regular expression in Python

2 Answers2