1

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.

from sting "1example4whatitry2do", I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}

So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo

I can do a loop on each caracter, but I hope there is a fast way

Thanks for your help !

anubhava
  • 761,203
  • 64
  • 569
  • 643
Manu64
  • 35
  • 2
  • 7
  • so you want to be able to generate a regex pattern from a single instance of what you're trying to match, then use that regex to search for additional matches? – Jethro Cao Oct 12 '19 at 07:43
  • Yes, you're right – Manu64 Oct 12 '19 at 08:22
  • and it can be assumed the strings will only contain alphanumeric characters? – Jethro Cao Oct 12 '19 at 08:36
  • Yes, I'm cleanning text before (remove extra caracters) and convert to lowercase But if you have a solution to mange it :) – Manu64 Oct 12 '19 at 09:00
  • What have you tried so far? What specifically can we help you with? – MisterMiyagi Oct 13 '19 at 08:06
  • I had tried to make loops but I didn't find the process very practical because in each case I had to specify the type of character I was looking for (alpha, numeric, special characters). Indeed, I should have specified all that, which is worth a -1. – Manu64 Oct 20 '19 at 15:16

1 Answers1

2

You can puzzle this out:

  • go over your strings characterwise
    • if the character is a text character add a 't' to a list
    • if the character is a number add a 'd' to a list
    • if the character is something else, add itself to the list

Use itertools.groupby to group consecutive identical letters into groups. Create a pattern from the group-key and the length of the group using some string literal formatting.

Code:

from itertools import groupby
from string import ascii_lowercase

lower_case = set(ascii_lowercase) # set for faster lookup

def find_regex(p):
    cum = []
    for c in p:
        if c.isdigit():
            cum.append("d")
        elif c in lower_case:
            cum.append("t")
        else:
            cum.append(c)

    grp = groupby(cum) 
    return ''.join(f'\\{what}{{{how_many}}}' 
                   if how_many>1 else f'\\{what}' 
                   for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))

pattern = "1example4...whatit.ry2do"

print(find_regex(pattern))

Output:

\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}

The ternary in the formatting removes not needed {1} from the pattern.

See:

If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.

pattern = "1example4...whatit.ry2do"

pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}

See

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • @Manu fixed some error - I somehow omitted the step of converting `'\t'` to a character range and I now use a set for character comparison what makes it faster. If it works, see: https://stackoverflow.com/help/someone-answers and https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – Patrick Artner Oct 13 '19 at 08:08