2

I have lots of text files full of newlines which I am parsing in python 3.4. I am looking for the newlines because they separate my text into different parts. Here is an example of a text :

text = 'avocat  ;\n\n       m. x'

I naïvely started looking for newlines with '\n' in my regular expression (RE) without thinking that the backslash '\' was an escape character. Howerver, this turned out to work fine:

>>> import re

>>> pattern1 = '\n\n'
>>> re.findall(pattern1, text)
['\n\n']

Then, I understood I should be using a double backslash in order to look for one backlash. This also worked fine:

>>> pattern2 = '\\n\\n'
>>> re.findall(pattern2, text)
['\n\n']

But on another thread, I was told to use raw strings instead of regular strings, but this format fails to find the newlines I am looking for:

>>> pattern3 = r'\\n\\n'
>>> pattern3
'\\\\n\\\\n'
>>> re.findall(pattern3, text)
[]

Could you please help me out here ? I am getting a little confused of what king of RE I should be using in order to correctly match the newlines.

Community
  • 1
  • 1
Tanguy
  • 3,124
  • 4
  • 21
  • 29

2 Answers2

5

Don't double the backslash when using raw string:

>>> pattern3 = r'\n\n'
>>> pattern3
'\\n\\n'
>>> re.findall(pattern3, text)
['\n\n']
Assem
  • 11,574
  • 5
  • 59
  • 97
2

OK I got it. In this nice Python regex cheat sheet it says: "Special character escapes are much like those already escaped in Python string literals. Hence regex '\n' is same as regex '\\n'".

This is why pattern1 and pattern2 were matching my text in my previous example. However, pattern3 is looking for '\\n' in already interpreted text, which actually is '\\\\n' in canonical string representation.

Tanguy
  • 3,124
  • 4
  • 21
  • 29