convert string to regex pattern

Question

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.

from sting "1example4whatitry2do", I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}

So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo

I can do a loop on each caracter, but I hope there is a fast way

Thanks for your help !

so you want to be able to generate a regex pattern from a single instance of what you're trying to match, then use that regex to search for additional matches? — Jethro Cao, Oct 12 '19 at 07:43
and it can be assumed the strings will only contain alphanumeric characters? — Jethro Cao, Oct 12 '19 at 08:36
Yes, I'm cleanning text before (remove extra caracters) and convert to lowercase But if you have a solution to mange it :) — Manu64, Oct 12 '19 at 09:00
What have you tried so far? What specifically can we help you with? — MisterMiyagi, Oct 13 '19 at 08:06
I had tried to make loops but I didn't find the process very practical because in each case I had to specify the type of character I was looking for (alpha, numeric, special characters). Indeed, I should have specified all that, which is worth a -1. — Manu64, Oct 20 '19 at 15:16

Patrick Artner · Accepted Answer · 2019-10-13T08:17:22.387

You can puzzle this out:

go over your strings characterwise
- if the character is a text character add a 't' to a list
- if the character is a number add a 'd' to a list
- if the character is something else, add itself to the list

Use itertools.groupby to group consecutive identical letters into groups. Create a pattern from the group-key and the length of the group using some string literal formatting.

Code:

from itertools import groupby
from string import ascii_lowercase

lower_case = set(ascii_lowercase) # set for faster lookup

def find_regex(p):
    cum = []
    for c in p:
        if c.isdigit():
            cum.append("d")
        elif c in lower_case:
            cum.append("t")
        else:
            cum.append(c)

    grp = groupby(cum) 
    return ''.join(f'\\{what}{{{how_many}}}' 
                   if how_many>1 else f'\\{what}' 
                   for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))

pattern = "1example4...whatit.ry2do"

print(find_regex(pattern))

Output:

\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}

The ternary in the formatting removes not needed {1} from the pattern.

See:

str.isdigit()

If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.

pattern = "1example4...whatit.ry2do"

pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}

See

string module for ascii_lowercase and digits

@Manu fixed some error - I somehow omitted the step of converting `'\t'` to a character range and I now use a set for character comparison what makes it faster. If it works, see: https://stackoverflow.com/help/someone-answers and https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work — Patrick Artner, Oct 13 '19 at 08:08

convert string to regex pattern

1 Answers1