Detect or Generate Regular Expression from String

Question

I was wondering if there were any Python packages out there that detects a regular expression from a string. Conceptually this is easy enough to do but I wanted to see if there was anyone else who has solved this problem.

To the extent that I looked around on my own, I've read the re package docs and didn't find it, and I read the best hits I could find on Stack Overflow and couldn't find one either. I've googled it and the hits I find are how to use regex to parse strings. I've searched through PyPI but the only hit I could find is 'regexgen 1.0' and that didn't seem to lead anywhere...

To be clear, what I am looking for is something to the effect of:

def detect_regex(some_string):
    [does stuff..]
    return regular_expression

Thoughts would be greatly appreciated. If there aren't any, I can write this myself. I just didn't want to waste time re-creating what's already been done. Thanks!

Edit: I may not have been very clear in my question; the regex 'foo' does match the string 'foo' but my goal is the following -- given the three strings with values foo123abc, abc078963bar, and xyz8940baz, the return would be ^[a-z]+[0-9]+[a-z]+$... hypothetically. So the regex would be "general" to some extent.

Every string is a valid regular expression (`foo` is a regular expression that matches the string "foo"), so you'll need to provide some example of what you consider a string with an embedded regular expression. — chepner, Jul 06 '15 at 19:38
@Chepner, wouldn't a string like `'((('` be an invalid regular expression? — TigerhawkT3, Jul 06 '15 at 19:40
@Chepner I may not have been very clear in my question. You are absolutely right, the regex foo matches the string foo. My goal is the following: given the three strings with values `foo123abc`, `abc078963bar`, and `xyz8940baz`, the return would be `^[a-z]+[0-9]+[a-z]+$`... hypothetically — fpes, Jul 06 '15 at 19:41
The linked question is not a duplicate of this question (whose I answer, I think, is "no", because there are too many possible regular expressions that would match any given set of strings). — chepner, Jul 06 '15 at 19:56
(Oh, it turns out I can reopen a question on my own, rather than vote to reopen it.) — chepner, Jul 06 '15 at 19:56
Are you not asking how to detect a valid regular expression, but rather how to automatically create a regular expression that matches a small test sample? I can tell you right now that that will not work. — TigerhawkT3, Jul 06 '15 at 20:10
@TigerhawkT3 yes, that's what I was hypothetically thinking and based on everyone's opinions, it seems an effort in vain. — fpes, Jul 06 '15 at 20:12
Sorry; SO didn't dynamically update this question once it was reopened. But yes, what you are asking is impossible. It would be like reverse-engineering the contents of a book based on its reviews, word for word. — TigerhawkT3, Jul 06 '15 at 20:15
Not possible practically. It is the equivalent of saying `how many formulas have the answer '42'?` Well, many many many formulas do. — dawg, Jul 06 '15 at 21:39

Martin Evans · Answer 1 · 2021-05-08T12:03:21.547

Very limited, but a bit of fun. Only deals with alpha and numerics as per your examples. Was also just curious and hopefully this helps to give you a suitable starting point:

def auto_re(text):
    cur = first = str.isalpha(text[0])
    count = 0
    
    for letter in text:
        if str.isalpha(letter) != cur:
            cur = not cur
            count += 1
    
    modes = ["[0-9]+", "[a-z]+"]
    return "^" + modes[first] + (count/2) * (modes[not first ] + modes[first]) + (count%2) * (modes[not first]) + "$"
            

tests = ["foo123abc", "abc078963bar", "xyz8940baz", "1a", "a1", "1a2b3c", "a12b3c"]

for test in tests:
    print "%-20s %s" % (test, auto_re(test))

Gives the following output:

foo123abc            ^[a-z]+[0-9]+[a-z]+$
abc078963bar         ^[a-z]+[0-9]+[a-z]+$
xyz8940baz           ^[a-z]+[0-9]+[a-z]+$
1a                   ^[0-9]+[a-z]+$
a1                   ^[a-z]+[0-9]+$
1a2b3c               ^[0-9]+[a-z]+[0-9]+[a-z]+[0-9]+[a-z]+$
a12b3c               ^[a-z]+[0-9]+[a-z]+[0-9]+[a-z]+$

Thanks for this. This is what I had in mind. Something that becomes clearer is that the rules of regex generation need to be specified. In your example, that regex should be built with just characters and strings. This is what I was planning to do and thanks for the head start! I'm not marking this is the correct answer though given the discussion in the comments. Let me know if you think otherwise. — fpes, Jul 07 '15 at 02:43

Detect or Generate Regular Expression from String

1 Answers1

Linked