55

I got a little confused about Python raw string. I know that if we use raw string, then it will treat '\' as a normal backslash (ex. r'\n' would be \ and n). However, I was wondering what if I want to match a new line character in raw string. I tried r'\\n', but it didn't work.

Anybody has some good idea about this?

martineau
  • 119,623
  • 25
  • 170
  • 301
wei
  • 3,312
  • 4
  • 23
  • 33
  • What kind of match are we talking about here? Are you talking about a regular expression match, or simply a `if ... in my_raw_string`? – mgilson Feb 04 '13 at 15:07
  • Sorry to confuse you. I'm talking about a regular expression. – wei Feb 04 '13 at 15:12
  • A raw string (really a raw string **literall**) is **not a different kind of string**; it is **only** a different way to *describe* the string in source code. The problem is simply "how do I match a newline character in a string using regex?"; that has **nothing to do with** raw strings. – Karl Knechtel Jan 21 '23 at 13:48

5 Answers5

54

In a regular expression, you need to specify that you're in multiline mode:

>>> import re
>>> s = """cat
... dog"""
>>> 
>>> re.match(r'cat\ndog',s,re.M)
<_sre.SRE_Match object at 0xcb7c8>

Notice that re translates the \n (raw string) into newline. As you indicated in your comments, you don't actually need re.M for it to match, but it does help with matching $ and ^ more intuitively:

>> re.match(r'^cat\ndog',s).group(0)
'cat\ndog'
>>> re.match(r'^cat$\ndog',s).group(0)  #doesn't match
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> re.match(r'^cat$\ndog',s,re.M).group(0) #matches.
'cat\ndog'
Alan W. Smith
  • 24,647
  • 4
  • 70
  • 96
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • 1
    Thanks for your answer @mgilson ! I'd also like to know why we need to specify multiline mode. I tried matching without it, like this "re.match(r'cat\ndog', s)" and it still works. – wei Feb 04 '13 at 15:33
  • @user1783403 -- You're correct. I should read the documentation more. specifying `re.M` gets `^` and `$` to match more intuitively. – mgilson Feb 04 '13 at 15:36
  • Any way to get `$` to match "less intuitively" - i.e. to match *only* at the end of the string? I don't want it to match before `\n` – Aaron McDaid Oct 07 '15 at 13:52
  • 2
    Use re.DOTALL option to match `\n`. – CKM Mar 20 '17 at 06:43
15

The simplest answer is to simply not use a raw string. You can escape backslashes by using \\.

If you have huge numbers of backslashes in some segments, then you could concatenate raw strings and normal strings as needed:

r"some string \ with \ backslashes" "\n"

(Python automatically concatenates string literals with only whitespace between them.)

Remember if you are working with paths on Windows, the easiest option is to just use forward slashes - it will still work fine.

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • @mgilson I was just checking it worked with raw strings and normal strings, as it's not something I had done. Edited as it does. It's actually a little better as I believe the concatenation is done at parse time, rather than when it's executed. – Gareth Latty Feb 04 '13 at 15:09
  • Yeah, I'd never actually checked before now either :) – mgilson Feb 04 '13 at 15:10
  • I'm not sure about the downvote either. I suppose somebody might have seen the comment about OP wanting this in the context of a regex and decided that this didn't apply. Anyway, FWIW, I upvoted because I liked the automagic concatenation of strings (raw and normal) – mgilson Feb 04 '13 at 17:57
  • Yeah, with that information, the post is a little out, but that wasn't there when I posted. Ah well, unexplained downvotes happen. – Gareth Latty Feb 04 '13 at 17:59
2

you also can use [\r\n] for matching to new line

2

Ten years on, this came up in my search for newline not matching.

There are a couple of options. Each targeting different angles.
This answer revolves around applicable regex flag.

  1. Raw string at Source
  2. regex match and flag
  3. regex flag: re.DOTALL

  1. Raw string at source:
    If it does not have 'impact' on the 'meaning' or context of the text, the newline, \n, can be replaced or stripped off. For instance, to strip off, one can use replace(), strip(), sub(), search() or any option/preference.

  2. regex match and flag
    Regex can be done in-place, wrapping around the newline: Regex can be out-place, matching anywhere within the text irrespective of the newline, \n.

  3. regex flag: re.DOTALL , re.MULTILINE | re.M
    The Python manual documents the re.DOTALL flag to 'extend' . pattern to also match newline. In my view, the .DOTALL handles the newline \n much better. It appears it allows multiple strings to be matches 'easily'. See sample code.

re.DOTALL [https://docs.python.org/3/library/re.html#re.DOTALL]
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).

(Dot.)
In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

$ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. Kindly note that in MULTILINE mode, match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line. See https://docs.python.org/3/library/re.html#search-vs-match

Below is a Python walkthrough of .DOTALL vis-a-vis .M

import re

## raw string
s1_str = '''"type": "Car",
  "brand": "Ford",
  "model": "Fiesta",
  "colour": "Black'''

s2_str = '"Type": "Car",\n"brand": "Ford", \n"model": "Fiesta", \n"colour": "Black'

##regex match 
m1_1 = re.match(r'(?=.*?[cC])(?=.*?Fiesta).*', s1_str, re.DOTALL)  #matches!!
m1_2 = re.match(r'(?=^.*?[cC]).*?$', s1_str, re.M)    #matches!!
m2_1 = re.match(r'(?=.*[cC]ar)(?=.*Fiesta).*', s2_str, re.DOTALL)  #matches
m2_2 = re.match(r'(?=.*[cC]ar)(?=.*Fiesta).*', s2_str, re.M)  #doesn't match
m2_3 = re.match(r'(?=.*[cC]ar)(?=.*brand).*', s2_str, re.DOTALL)  #matches
m2_4 = re.match(r'(?=.*[cC]ar)(?=.*brand).*', s2_str, re.M)  #doesn't match
m2_5 = re.match(r'.*Car",\n"brand":', s2_str, re.DOTALL)  #matches
m2_6 = re.match(r'.*Car",$\n"brand', s2_str) #doesn't match
m2_7 = re.match(r'.*Car",$\n"brand', s2_str, re.M)  #matches
m2_8 = re.match(r'.*Car",$\n"brand', s2_str, re.DOTALL)  #doesn't matches
m2_9 = re.match(r'.*Car.*\n"brand', s2_str, re.DOTALL)  #matches  #.group

match_list = [m1_1, m1_2, m2_1, m2_2, m2_3, m2_4, m2_5, m2_6, m2_7, m2_8, m2_9]
print('matches: \nm1_1| {0} \nm1_2| {1} \nm2_1| {2} \nm2_2| {3} \nm2_3| {4} \nm2_4| {5} \nm2_5| {6} \nm2_6| {7} \nm2_7| {8} \nm2_8| {9} \nm2_9| {10}'.format(*match_list))

[output]

matches: 
m1_1| <re.Match object; span=(0, 73), match='"type": "Car",\n  "brand": "Ford",\n  "model": "F> 
m1_2| <re.Match object; span=(0, 14), match='"type": "Car",'> 
m2_1| <re.Match object; span=(0, 69), match='"Type": "Car",\n"brand": "Ford", \n"model": "Fies> 
m2_2| None 
m2_3| <re.Match object; span=(0, 69), match='"Type": "Car",\n"brand": "Ford", \n"model": "Fies> 
m2_4| None 
m2_5| <re.Match object; span=(0, 23), match='"Type": "Car",\n"brand":'> 
m2_6| None 
m2_7| <re.Match object; span=(0, 21), match='"Type": "Car",\n"brand'> 
m2_8| None 
m2_9| <re.Match object; span=(0, 21), match='"Type": "Car",\n"brand'>

[UPDATED]
I've taken note of interesting posts on regex flag

semmyk-research
  • 333
  • 1
  • 9
0
def clean_with_puncutation(text):    
    from string import punctuation
    import re
    punctuation_token={p:'<PUNC_'+p+'>' for p in punctuation}
    punctuation_token['<br/>']="<TOKEN_BL>"
    punctuation_token['\n']="<TOKEN_NL>"
    punctuation_token['<EOF>']='<TOKEN_EOF>'
    punctuation_token['<SOF>']='<TOKEN_SOF>'
  #punctuation_token



    regex = r"(<br/>)|(<EOF>)|(<SOF>)|[\n\!\@\#\$\%\^\&\*\(\)\[\]\
           {\}\;\:\,\.\/\?\|\`\_\\+\\\=\~\-\<\>]"

###Always put new sequence token at front to avoid overlapping results
 #text = '<EOF>!@#$%^&*()[]{};:,./<>?\|`~-= _+\<br/>\n <SOF>\ '
    text_=""

    matches = re.finditer(regex, text)

    index=0

    for match in matches:
     #print(match.group())
     #print(punctuation_token[match.group()])
     #print ("Match at index: %s, %s" % (match.start(), match.end()))
        text_=text_+ text[index:match.start()] +" " 
              +punctuation_token[match.group()]+ " "
        index=match.end()
    return text_