1

I’m using the re.escape in python. I got confused why re.escape('\n') is '\\\n'? I though it should be '\\n' because it needs to match the new line character. Can anyone explain?

shu
  • 115
  • 1
  • 9
  • Try `print re.escape('\n')`, as opposed to just `re.escape('\n')`, and see the question and answer at http://stackoverflow.com/questions/301068/python-backslash-quoting-in-string-literals – Charles Duffy Feb 17 '15 at 01:15
  • Pardon? You might check how StackOverflow formatted your comment; it's not readable as given. – Charles Duffy Feb 17 '15 at 02:07
  • The literal backslash and the newline is exactly what you would expect: The backslash escapes the newline, so compiling that two-character literal string (for which `repr()` renders into a four-character representation) into a regular expression results in a regex that escapes only a single character -- the newline. – Charles Duffy Feb 17 '15 at 02:12
  • If you don't expect `len('\\\n')` to be 2, then it's time to check some assumptions. :) – Charles Duffy Feb 17 '15 at 02:12

1 Answers1

1

As the documentation clearly states what the re.escape() function does:

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

>>> import re
>>> re.escape('\n')
'\\\n'
 ^^^^
 | |
 | |__________________ The \n metacharacter
 | 
 |____________________ Returned backslash

When this function is used, it places a backslash in front of all metacharacters.

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Saying that it's "returned backslashes", plural, isn't accurate -- it's only one backslash, but represented with two backslash characters when formatted by repr() at the prompt. – Charles Duffy Feb 17 '15 at 01:17
  • I’m still a little confused.. Does it make sense if the output is `\\n`? Like, python interprets it as `\n`, then re interprets it as the `\n` metacharacter, which is the parameter in re.escape(). – shu Feb 17 '15 at 02:03
  • @shu, the latter `\n` is the repr() representation of a literal carriage return -- so, you have a literal slash, followed by a literal carriage return. This makes sense, because the slash escapes the carriage return, so the resulting regex matches a literal carriage return and nothing else, just as it should. – Charles Duffy Feb 17 '15 at 02:08
  • @CharlesDuffy Is the literal CR constructed by a backslash and a character n? And I’ve tried `re.compile(‘\\\n’).match('\n')` ,it works. But `re.compile(‘\\n’).match('\n')` also works.. – shu Feb 17 '15 at 02:55
  • @CharlesDuffy Sorry for the format. I could understand the result of `len(‘\\\n’)`, but.. is it like, `backslash + CR -> CR`, and `backslash + n -> CR` also? – shu Feb 17 '15 at 02:57
  • @shu, what do you mean, "constructed"? The literal newline is a single character, by definition. Inside of single quotes in Python, that literal newline can be represented by two characters, the first of those being a backslash, and the second being an `n`. Granted, inside a regex, a _literal_ backslash followed by an `n` will _also_ generate a pattern that matches a literal newline, but those characters don't meet the definition of "literal", so what exactly you mean by "constructed" matters. – Charles Duffy Feb 17 '15 at 03:10
  • ...but yes, backslash + CR will match a CR, and backslash + n will also _match_ a CR, even though only the latter formulation involves any literals. This is to say that `re.escape()` could also be written in such a way as to emit a backslash followed by an `n`, and would be correct if it did so -- though its current behavior is _also_ correct; which one of those correct formulations it uses at any given version is simply an implementation detail. – Charles Duffy Feb 17 '15 at 03:12
  • @CharlesDuffy I mean the re or python treats a backslash and a `n` together as a CR. – shu Feb 17 '15 at 03:14
  • 1
    @shu, and the point I'm making is that both of them do (the former when in single-quoted strings -- that's not true for some other quoting types; `r'\n'`, for instance, is parsed as two characters by Python, whereas `'\n'` is parsed as one). Both `'\n'` and `r'\n'` will result in a regex which matches a single newline, however; this dichotomy is why there are _two_ correct formulations. – Charles Duffy Feb 17 '15 at 03:17
  • @CharlesDuffy I think I've got what you mean of the literal newline, and the re implementation detail. Cool! Thank you. – shu Feb 17 '15 at 03:20
  • Ergh. "Even though only the latter formulation involves any literals" should have been "even though only the former formulation involves a literal carriage return". My apologies -- but very glad to hear the point got through. :) – Charles Duffy Feb 17 '15 at 03:43