1

String in question:

ipAddressString = "192.192.10.5/24"

I'm trying to match 192.192 in the above string.

a) The below code gives error, I don't understand why \1 is not matching the second 192:

>>> print re.search('(\d{1,3})\.\1',ipAddressString).group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

I was expecting the output to be : 192.192

b) Whereas, when I use the below regex, it matches 192.192 as expected, as per my understanding the above regex mentioned in point a) should have yielded the same ".group()" output as below regex

>>> print re.search('(\d{1,3})\.(\d{1,3})',ipAddressString).group()
192.192
Zizou
  • 67
  • 5
  • 2
    Try using a raw string `print (re.search(r'(\d{1,3})\.\1',ipAddressString))` https://ideone.com/TSCCkW – The fourth bird Jul 24 '19 at 15:50
  • 1
    `\1` in `'(\d{1,3})\.\1'` is a char with octal number 1, a SOH (*Start Of Heading*) char. `r'\1'` is a backreference, a combination of ``\`` and `1` chars. – Wiktor Stribiżew Jul 24 '19 at 15:51
  • Plus one simply for teaching me about the existence of group backreferences. – MonkeyZeus Jul 24 '19 at 15:53
  • Possible duplicate of [Handling backreferences to capturing groups in re.sub replacement pattern](https://stackoverflow.com/questions/8157267/handling-backreferences-to-capturing-groups-in-re-sub-replacement-pattern) – mx0 Jul 24 '19 at 17:05
  • @mx0 - What's the title of that link ? Is it _Backreferences in string Parsing_ ? If its _not_ then it is not a duplicate. –  Jul 24 '19 at 18:05

1 Answers1

2

List of escape sequences available in Python 3

Those are the escapes interpolated when parsing a string in Python.
All other escaped items are ignored.

So, if you give it a string like '(\d{1,3})\.\1'
it interpolates the \1 as a character with an octal value of 1.

\ooo Character with octal value ooo

So this is what you get

>>> import re
>>> ipAddressString = "192.192.10.5/24"
>>> hh = re.search('(\d{1,3})\.\1',ipAddressString)
>>> print (hh)
None
>>> print ('(\d{1,3})\.\1')
(\d{1,3})\.☺

The regex engine sees this (\d{1,3})\.☺ which is not an error
but it doesn't match what you want.

Ways around this:

  • Escape the escape on the octal
    '(\d{1,3})\.\\1'
  • Make the string a raw string with syntax
    either a raw double r"(\d{1,3})\.\1" or a raw single r'(\d{1,3})\.\1'

Using the first method we get:

>>> import re
>>> ipAddressString = "192.192.10.5/24"
>>> hh = re.search('(\d{1,3})\.\\1',ipAddressString)
>>> print (hh)
<re.Match object; span=(0, 7), match='192.192'>
>>> print ('(\d{1,3})\.\\1')
(\d{1,3})\.\1

Just a side note, most regex engines also recognize octal sequences. But to differentiate an octal from a back reference it usually requires a leading \0then a 2 or 3 digit octal \0000-\0377 for example, but sometimes it doesn't and will accept both.

Thus, there is a gray area of overlap.

Some engines will mark the back reference (example \2) when it finds
an ambiguity, then when finished parsing the regex, revisit the item
and mark it as a back reference if the group exists, or an octal
if it doesn't. Perl is famous for this.

In general, each engine handles the issue of octal vs back reference
in it's own bizarre way. Its always a gotcha waiting to happen.