0

In the following:

>>> r'\d+','\d+', '\\d+'
('\\d+', '\\d+', '\\d+')

Why does the backslash in '\d+' not need to be escaped? Why does this give the same result as the other two literals?

Similarly:

>>> r'[a-z]+\1', '[a-z]+\1'
('[a-z]+\\1', '[a-z]+\x01')

Why does the \1 get converted into a hex escape?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
David542
  • 104,438
  • 178
  • 489
  • 842
  • Related [answer](https://stackoverflow.com/a/24085681/) – metatoaster Oct 01 '22 at 02:54
  • 2
    It took me a while to understand the question for the first part; I edited to make it more clear. Sorry about the temporary hammering. I'm still mystified by the second part, though. "Why does the \1 get converted into a hex escape?" - **what do you think should happen instead? Why?** And how is this supposed to be related to the first example? – Karl Knechtel Oct 01 '22 at 17:36

2 Answers2

6

String and Bytes literals has tables showing which backslash combinations are actually escape sequences that have a special meaning. Combinations outside of these tables are not escapes, are not part of the raw string rules and are treated as regular characters. "\d" is two characters as is r"\d". You'll find, for instance, that "\n" (a single newline character) will work differently than \d.

\1 is an \ooo octal escape. When printed, python shows the same character value as a hex escape. Interestingly, \8 isn't octal but instead of raising an error, python just treats it as two characters (because its not an escape).

tdelaney
  • 73,364
  • 6
  • 83
  • 116
4

Because \d is not an escape code. So, however you type it, it is interpreted as a literal \ then a d. If you type \\d, then the \\ is interpreted as an escaped \, followed by a d.

The situation is different if you choose a letter part of an escape code.

r'\n+','\n+', '\\n+'

('\\n+', '\n+', '\\n+')

The first one (because raw) and the last one (because \ is escaped) is a 3-letter string containing a \ a n and a +. The second one is a 2 letter string, containing a '\n' (a newline) and a +

The second one is even more straightforward. Nothing strange here. r'\1' is a backslash then a one. '\1' is the character whose ASCII code is 1, whose canonical representation is '\x01' '\1', '\x01' or '\001' are the same thing. Python cannot remember what specific syntax you used to type it. All it knows is it that is the character of code 1. So, it displays it in the "canonical way".

Exactly like 'A' '\x41' or '\101' are the same thing. And would all be printed with the canonical representation, which is 'A'

chrslg
  • 9,023
  • 5
  • 17
  • 31