1

I am getting started learning Python3 and currently I am totally lost in 'regular expressions'. In order to understand raw strings I wrote this code:

import re
pattern='\\n'
c='qwerty\. i am hungry. \n z'
d=r'qwerty\. i am hungry. \n z '
print(c)
print(d+'\n\n\n')
for a in (c,d,):
    if re.search(pattern,a):
        print('"{0}" contains "{1}" \n'.format(a, pattern))
    else:
        print('"{0}" does not contain "{1}" \n'.format(a, pattern))

Output is: first string contains pattern and the second doesn't. However, once I introduce a minor change in the pattern:

import re
pattern=r'\\n'
c='qwerty\. i am hungry. \n z'
d=r'qwerty\. i am hungry. \n z '
print(c)
print(d+'\n\n\n')
for a in (c,d,):
    if re.search(pattern,a):
        print('"{0}" contains "{1}" \n'.format(a, pattern))
    else:
        print('"{0}" does not contain "{1}" \n'.format(a, pattern))

The result gets reversed. The second string contains r'\\n' , which I cannot understand, since there is no double backslash in it... Could you please explain this mystery to me?

Serhii Orlyk
  • 13
  • 1
  • 3

2 Answers2

1

A raw string essentially tells the system to read the backslashes in the following string as what they are - backslashes. So,

print(r'hi\nhi')

Prints out hi\nhi.
However, the system treats backslashes in non-raw strings as a method to escape out the following character. Hence,

print('hi\nhi')

Prints:

hi
hi

So, the \n in non-raw strings becomes a newline.


In your code, pattern contains a string with a newline, not a backslash and n. Were you to use pattern = r'\n', pattern would contain a backslash and n, but not a newline

Hence, searching for a \\n in the string, essentially tells the system to escape out a \ (thus, it searches for a backslash) followed by n.
First of all, let's clarify: c contains a newline, and d contains \n, literally. This can be verified by printing the strings.

  • When you search for '\\n', the regex pattern searches for a newline. So, c matches, but d does not match.

  • When pattern = '\n' then c matches, but d does not.

  • When pattern = r'\\n', then d matches, but c does not.

Robo Mop
  • 3,485
  • 1
  • 10
  • 23
1

According to the rules of interpolation:

'\n' becomes the ascii byte 0x0A; this applies to your first string to match.
r'\n' becomes the literal \n, that is \ followed by n; this applies to the second string to match.

'\\n' becomes the literal \n; this applies to your first pattern string.
r'\\n' becomes the literal \\n; this applies to the second pattern string.

When you perform the matching there is another round of interpolation done on patterns by re.search:
the literal \n turns into the ascii byte 0x0A (first pattern)
the literal \\n turns into the literal \n (second pattern)

So in the end your first string matches the first pattern as both contain ascii 0x0A,
and the second string matches the second pattern as both contain literal \n.

That's it, no mystery here.

wolfrevokcats
  • 2,100
  • 1
  • 12
  • 12