32

Consider this (very simplified) example string:

1aw2,5cx7

As you can see, it is two digit/letter/letter/digit values separated by a comma.

Now, I could match this with the following:

>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>

The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.

I tried using a named capture group:

>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>

But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.

Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?

6 Answers6

26

No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.

You can always do so by re-using Python variables, of course:

digit_letter_letter_digit = r'\d\w\w\d'

then use string formatting to build the larger pattern:

match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)

or, using Python 3.6+ f-strings:

dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)

I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.

If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:

(\d\w\w\d),(?1)
^........^ ^..^
|           \
|             re-use pattern of capturing group 1  
\
  capturing group 1

You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).

And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:

(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
          ^...............^ ^......^ ^......^
          |                    \       /          
 creates 'dlld' pattern      uses 'dlld' pattern twice

Just to be explicit: the standard library re module does not support subroutine patterns.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    @iCodez Is [this other answer](https://stackoverflow.com/a/21560430/3719101) with named groups instead, like `(?'digitletters'\d\w\w\d),(?&digitletters)` not a way to actually "symbolize patterns" and factorize them within the regex? If yes, maybe you could mark it as accepted instead or people will keep thinking there is no way to do so. – iago-lito May 29 '19 at 07:46
  • 1
    @iago-lito: The Python `re` module doesn't support recursive patterns. Only `regex` does. Note that you can't ping the OP in a comment on answers they haven't participated in. – Martijn Pieters May 29 '19 at 12:37
  • Arh, okay. Thank you for both clarifications :) Maybe it's worth at least notifying readers that PCRE supports it? I came to this post while not specifically searching for a python-flavored regex solution. – iago-lito May 29 '19 at 13:00
  • 1
    @iago-lito: I don't quite see the point. This question is about Python and it's standard library `re` module, not about regular expression engines in general. There are way too many variations between engines, there is no one standard regular expression syntax. You'd be better off going to a site like https://www.regular-expressions.info/ which specialises in tracking [the various different regex features and what implementations support which ones of those](https://www.regular-expressions.info/refbasic.html). – Martijn Pieters May 29 '19 at 13:21
8

Note: this will work with PyPi regex module, not with re module.

You could use the notation (?group-number), in your case:

(\d\w\w\d),(?1)

it is equivalent to:

(\d\w\w\d),(\d\w\w\d)

Be aware that \w includes \d. The regex will be:

(\d[a-zA-Z]{2}\d),(?1)
Toto
  • 89,455
  • 62
  • 89
  • 125
  • 1
    Too bad :-( it's a PCRE feature, I thought Python recognize it. – Toto Feb 04 '14 at 18:50
  • For named capture groups use `(?&name)`. The alternative forms `(?P>name`) and `(?P&name)` are also supported. `regex` is great! – AXO Dec 26 '17 at 05:58
  • Yes! The `regex` module on PyPI is great! This answer is also great! +1 – iBug Dec 27 '18 at 11:16
0

I was troubled with the same problem and wrote this snippet

import nre
my_regex=nre.from_string('''
a=\d\w\w\d
b={{a}},{{a}}
c=?P<id>{{a}}),(?P=id)
''')
my_regex["b"].match("1aw2,5cx7")

For lack of a more descriptive name, I named the partial regexes as a,b and c.

Accessing them is as easy as {{a}}

Uri Goren
  • 13,386
  • 6
  • 58
  • 110
0
import re
digit_letter_letter_digit = re.compile("\d\w\w\d") # we compile pattern so that we can reuse it later
all_finds = re.findall(digit_letter_letter_digit, "1aw2,5cx7") # finditer instead of findall
for value in all_finds:
    print(re.match(digit_letter_letter_digit, value))
Uddhav P. Gautam
  • 7,362
  • 3
  • 47
  • 64
0

Since you're already using re, why not use string processing to manage the pattern repetition as well:

pattern = "P,P".replace("P",r"\d\w\w\d")

re.match(pattern, "1aw2,5cx7")

OR

P = r"\d\w\w\d"

re.match(f"{P},{P}", "1aw2,5cx7")
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • That's rather unreadable. Why not just use string substitutions? – Martijn Pieters May 29 '19 at 13:14
  • @Martijn Pieters, you are right. In fact using re.sub() didn't actually work as I wrote it because regex special characters were being processed instead of simply replacing the source. – Alain T. May 29 '19 at 13:38
-1

Try using back referencing, i believe it works something like below to match

1aw2,5cx7

You could use

(\d\w\w\d),\1

See here for reference http://www.regular-expressions.info/backref.html

Srb1313711
  • 2,017
  • 5
  • 24
  • 35
  • 4
    Thanks for the answer, but this won't actually work in my case. Using `\1` will make it look for two occurrences of `1aw2`. I want two occurrences of `\d\w\w\d`, regardless of the digits/letters. –  Nov 06 '13 at 14:11
  • The [`\1` back reference](https://www.regular-expressions.info/backref.html) matches the *literal text* that the numbered group matches. It doesn't re-use the pattern. – Martijn Pieters May 29 '19 at 13:13