0

I'm reading python doc of re library and quite confused by the following paragraph:

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

How is \\\\ evaluated?

\\\\ -> \\\ -> \\ cascadingly

or \\\\ -> \\ in pairs?

I know \ is a meta character just like |, I can do

>>> re.split('\|', 'a|b|c|d') # split by literal '|'
['a', 'b', 'c', 'd']

but

>>> re.split('\\', 'a\b\c\d') # split by literal '\'
Traceback (most recent call last):

gives me error, it seems that unlike \| the \\ evaluates more than once.

and I tried

>>> re.split('\\\\', 'a\b\c\d')
['a\x08', 'c', 'd']

which makes me even more confused...

mzoz
  • 1,273
  • 1
  • 14
  • 28

3 Answers3

4

There are two things going on here - how strings are evaluated, and how regexes are evaluated.

  • 'a\b\c\d' in python <3.7 code represents the string a<backspace>\c\d
  • '\\\\' in python code represents the string \\.
  • the string \\ is a regex pattern that matches the character \

Your problem here is that the string you're searching is not what you expect.

\b is the backspace character, \x08. \c and \d are not real characters at all. In python 3.7, this will be an error.

I assume you meant to spell it r'a\b\c\d' or 'a\\b\\c\\d'

Eric
  • 95,302
  • 53
  • 242
  • 374
2
re.split('\\', 'a\b\c\d') # split by literal '\'

You forgot that '\' in the second one is escape character, it would work if the second one was changed:

re.split(r'\\', 'a\\b\\c\\d')

This r at the start means "raw" string - escape characters are not evaluated.

Shan
  • 369
  • 2
  • 9
1

Think about the implications of evaluating backslashes cascadingly:

If you wanted the string \n (not the newline symbol, but literally \n), you couldn't find a sequence of characters to get said string.

\n would be the newline symbol, \\n would be evaluated to \n, which in turn would become the newline symbol again. This is why escape sequencens are evaluated in pairs.

So you need to write \\ within a string to get a single \, but you need to have to backslashes in your string so that the regex will match the literal \. Therefore you will need to write \\\\ to match a literal backslash.

You have a similar problem with your a\b\c\d string. The parser will try to evaluate the escape sequences, and \b is a valid sequence for 'backspace', represented as \x08. You will need to escape your backslashes here, too, like a\\b\\c\\d.

sandbo00
  • 355
  • 5
  • 15