2

Why doesn't \0 work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also \1, \2, ... ?

This simple example (executed in Python 3.7) says it all:

import re

subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\0'
expand_template_group = r'\1'

regexp_obj = re.compile(regexp_pattern)

match = regexp_obj.search(subject)
if match:
    print('Full match, by method: {}'.format(match.group(0)))
    print('Full match, by template: {}'.format(match.expand(expand_template_full)))
    print('Capture group 1, by method: {}'.format(match.group(1)))
    print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))

The output from this is:

Full match, by method: 123
Full match, by template: 
Capture group 1, by method: 2
Capture group 1, by template: 2

Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?

Is this a Python bug?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
QuestionOverflow
  • 649
  • 8
  • 20
  • `expand_template_full = r'\g<0>'` will work ;) – Olvin Roght Sep 27 '19 at 12:53
  • Think of `\0` as reserved for a substitution for the whole match. It will return the whole string that matched your pattern - even if not in a capture group. Then your actual capture groups start at sub 1 (`\1`). This is the case in many languages (maybe all) - not just python. The reason why `match.group(0)` returns the same as `\1` is because that only looks at your actual capture groups. – dvo Sep 27 '19 at 12:59
  • @dvo, you seem to have mixed up more than one thing about my question? Thanks anyway though! – QuestionOverflow Oct 02 '19 at 22:45
  • The duplicate claim that has been attached to this question is wrong, please remove it! (and the related downvote, please...) That other question is NOT at all about the same thing as this question! The only even remotely close similarity is that someone mentions _something_ about capture group 0 in one of that question's answers, which also isn't even relevant to that question! So, again, please remove the duplicate marking (and downvote) for this question, please! – QuestionOverflow Oct 02 '19 at 22:49

3 Answers3

4

Huh, you're right, that is annoying!

Fortunately, Python's way ahead of you. The docs for sub say this:

In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number.... The backreference \g<0> substitutes in the entire substring matched by the RE.

So your code example can be:

import re

subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\g<0>'

regexp_obj = re.compile(regexp_pattern)

match = regexp_obj.search(subject)
if match:
    print('Full match, by template: {}'.format(match.expand(expand_template_full)))

You also asked the far more interesting question of "why?". The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it's not clear whether \10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn't explain why \0 doesn't work. I've not been able to find a PEP explaining the rationale, but here's my guess:

We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of \0 "backreferencing" to the entire matched string is nonsensical; the hypothetical regex r'A\0' would match an infinitely long string of A characters and nothing else. So we cannot allow \0 to exist as a backreference. If you can't match with a backreference that looks like that, you shouldn't be able to replace with it either.

I can't say I agree with this logic, \g<> is already an arbitrary extension, but it's an argument that I can see someone making.

ymbirtt
  • 1,481
  • 2
  • 13
  • 24
  • 2
    `\0` or `&` is often present in other regex implementations to refer to entire matched portion in replacement section. In python, you just have to use `\g<0>` because `\0` is treated as octal (I don't know why that is though) – Sundeep Sep 27 '19 at 13:26
  • 2
    @Sundeep, cause python regex library recognize `\0` as NULL – Olvin Roght Sep 27 '19 at 13:28
  • Marked as answer for best details and reasoning, thanks! Too bad they didn't put any info at all about the "full-match group" in the match.expand() docs (and also not anything specifically about \0 in the re.sub() docs either, for that matter, although at least a little more general info there about the "full match group", as per your quote from the docs above). – QuestionOverflow Oct 02 '19 at 22:43
3

If you will look into docs, you will find next:

The backreference \g<0> substitutes in the entire substring matched by the RE.

A bit more deep in docs (back in 2003) you will find next tip:

There is a group 0, which is the entire matched pattern, but it can't be referenced with \0; instead, use \g<0>.

So, you need to follow this recommendations and use \g<0>:

expand_template_full = r'\g<0>'
Olvin Roght
  • 7,677
  • 2
  • 16
  • 35
  • Thanks for the info! Too bad they didn't put it in the match.expand() docs too (not to mention buried the explicit info about \0 in some 15 year old docs...), which is where I looked (since I was using this method, rather than re.sub()). – QuestionOverflow Oct 02 '19 at 22:35
2

Quoting from https://docs.python.org/3/library/re.html

\number

Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.

To summarize:

  • Use \1, \2 up to \99 provided no more digits are present after the numbered backreference
  • Use \g<0>, \g<1>, etc (not limited to 99) to robustly backreference a group
    • as far as I know, \g<0> is useful in replacement section to refer to entire matched portion but wouldn't make sense in search section
    • if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns
Community
  • 1
  • 1
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • Thanks for the info! Too bad they didn't put it in the match.expand() docs at all, and also not specifically about \0 in the re.sub() docs either, but rather only in the middle of some larger essay about the re library, which is not normally where developers need to look for info about using a method... – QuestionOverflow Oct 02 '19 at 22:39