Understanding regexes: r'(.)\1' vs r'(.)' vs r'(.)\1*'

Question

In these Python examples, why does:

>>> sub(r'(.)\1*', lambda m: str(m.group()*3)+'-', 'abc')

outputs

'aaa-bbb-ccc-'

and

>>> sub(r'.*', lambda m: str(m.group()*3)+'-', 'abc')

'abcabcabc--'

and:

>>> sub(r'(.*)\1', lambda m: str(m.group()*3)+'-', 'abc')

'-a-b-c-'

Why don't the first or last match the whole string? (and pass it as a group).

Why does the second one add two '-' at the end?

ShadowRanger · Answer 1 · 2021-02-10T03:07:11.370

For #1, \1 matches what the group actually matched, not what it could have matched, which is I suspect the root of your confusion. The . resolves to a specific character (e.g. a), then the \1* resolves to zero or more as, not zero or more "any characters". Since no character is repeated, you only match a single character at a time.

In #3, you match only empty strings, because there are no longer strings that repeat in the input, so \1 only applies when (.*) captured nothing (the empty string), which it does once at the beginning, end, and in-between each character in the input.

Both of the above are pretty straightforward if you know regex syntax. But #2 is the weird one, in that it matches the whole string (.* captures the whole thing), then, in an arguably incorrect follow-on, matches the "following" empty string, and replaces each, leading to two hyphens.

Really, the short answer to your entire question is "* is an unsafe/non-intuitive quantifier to start with, and applying it to . makes it worse". Try to use + where at all possible, as it avoids these "match nothing" weirdo cases.

Understanding regexes: r'(.*)\1' vs r'(.*)' vs r'(.)\1*'

1 Answers1

Understanding regexes: r'(.)\1' vs r'(.)' vs r'(.)\1*'