How is python regex '\\\\' evaluated?

Question

I'm reading python doc of re library and quite confused by the following paragraph:

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

How is \\\\ evaluated?

\\\\ -> \\\ -> \\ cascadingly

or \\\\ -> \\ in pairs?

I know \ is a meta character just like |, I can do

>>> re.split('\|', 'a|b|c|d') # split by literal '|'
['a', 'b', 'c', 'd']

but

>>> re.split('\\', 'a\b\c\d') # split by literal '\'
Traceback (most recent call last):

gives me error, it seems that unlike \| the \\ evaluates more than once.

and I tried

>>> re.split('\\\\', 'a\b\c\d')
['a\x08', 'c', 'd']

which makes me even more confused...

Thanks Ulysse but since \\\\ evaluates to \\, which ultimately evaluates to \ literal, why can't we use \\ regex at the first place? — mzoz, Jul 26 '18 at 14:36
` \\\\ ` evaluates to ` \\ ` and this is it. There is no re-evalution of your string — Ulysse BN, Jul 26 '18 at 14:37

Eric · Accepted Answer · 2018-07-26T15:53:31.473

4

There are two things going on here - how strings are evaluated, and how regexes are evaluated.

'a\b\c\d' in python <3.7 code represents the string a<backspace>\c\d
'\\\\' in python code represents the string \\.
the string \\ is a regex pattern that matches the character \

Your problem here is that the string you're searching is not what you expect.

\b is the backspace character, \x08. \c and \d are not real characters at all. In python 3.7, this will be an error.

I assume you meant to spell it r'a\b\c\d' or 'a\\b\\c\\d'

edited Jul 26 '18 at 15:53

answered Jul 26 '18 at 14:37

Eric

95,302
53
242
374

1

`\a` would be the bell, `\b` should be backspace. – sandbo00 Jul 26 '18 at 14:45
Good catch, fixed – Eric Jul 26 '18 at 15:53

Shan · Answer 2 · 2018-07-26T14:40:08.613

2

re.split('\\', 'a\b\c\d') # split by literal '\'

You forgot that '\' in the second one is escape character, it would work if the second one was changed:

re.split(r'\\', 'a\\b\\c\\d')

This r at the start means "raw" string - escape characters are not evaluated.

edited Jul 26 '18 at 14:40

answered Jul 26 '18 at 14:37

Shan

369
2
9

That is more likely to be a comment than an answer – Ulysse BN Jul 26 '18 at 14:38
`r'\'` is a syntax error - raw strings cannot end with an unpaired backslash – Eric Jul 26 '18 at 14:39
Yep, forgot about this one, just edited it. – Shan Jul 26 '18 at 14:40
I'd forgotten that `r'\\'` _is_ legal – Eric Jul 26 '18 at 14:41

score 1 · Answer 3 · answered Jul 26 '18 at 14:43

Think about the implications of evaluating backslashes cascadingly:

If you wanted the string \n (not the newline symbol, but literally \n), you couldn't find a sequence of characters to get said string.

\n would be the newline symbol, \\n would be evaluated to \n, which in turn would become the newline symbol again. This is why escape sequencens are evaluated in pairs.

So you need to write \\ within a string to get a single \, but you need to have to backslashes in your string so that the regex will match the literal \. Therefore you will need to write \\\\ to match a literal backslash.

You have a similar problem with your a\b\c\d string. The parser will try to evaluate the escape sequences, and \b is a valid sequence for 'backspace', represented as \x08. You will need to escape your backslashes here, too, like a\\b\\c\\d.

How is python regex '\\\\' evaluated?

3 Answers3