1

I have been reading a lot of questions to find the answer and sorry if i missed it.

Let's say I have a text containing only a new line character.
text ='\n'

Because Regular Expression use the backslash character ('\') to escape special meaning characters like Python, we would match the new line character by using raw string notation just like this answer suggested. (Please do correct me if i am wrong)

So we would do regex = re.compile(r'\n'), and the regex parser could read a backslash and a character 'n' and interpret it as new line character.

My question is why does regex = re.compile('\n') also work too?

I tried to do regex.match(text) and the result is <_sre.SRE_Match object; span=(0, 1), match='\n'>, which is the same with raw string notation.


Is it because of the document written in here? which says:

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: \a \b \f \n \r \t \v \x \\

Could someone explain in details?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Le0
  • 107
  • 1
  • 8
  • 1
    I think your question is an answer, 2-in-1. The regex engine is actually capable of searching for a literal newline. – Wiktor Stribiżew Jun 18 '16 at 19:23
  • The string produced by the literal `'\n'` _does not_ contain a backslash. It contains a new-line character. A new-line character doesn't have any special regex meaning. – khelwood Jun 18 '16 at 19:27
  • @WiktorStribiżew I am just confused on how does it works.... so i am expecting a more detail explanation – Le0 Jun 18 '16 at 19:54
  • @khelwood does it mean \a \f \r \t \v \x \\ would work too as they have no special meaning to regex? – Le0 Jun 18 '16 at 19:56
  • @LeO if you write (for instance) `'\a\f\r\t\v\x00'` as a string literal, the string it produces will not contain a backslash, so the regex parser will not regard the string as containing any special regex characters. If you put `\\` in a string literal, the string produced contains a single backslash, which does have a special meaning to the regular expression parser, depending on the character that follows. – khelwood Jun 18 '16 at 20:07

1 Answers1

1

The r'\n' suppresses the interpretation of the string literal. This way, it contains two characters '\' and 'n'. The two characters are interpreted by the regular expression engine as newline sequence. In the second case, the '\n' is first converted to the newline sequence (that is LF on Unix-based system, that is one character; or to CR LF on Windows, that is two characters,...). The regular expression compiler takes it as explicitly given characters (no backslash, no special interpretation).

pepr
  • 20,112
  • 15
  • 76
  • 139
  • So does it mean regex compiler could also take \a \b \f \r \t \v \x \\ as well? – Le0 Jun 18 '16 at 19:52
  • @Le0: Yes, this is the same case. Basically, you want to express a special character (or a sequence of special characters). You can write it directly into the source code. However, some of the characters would be invisible in the editor, or would break the syntax of the programming language (like the newline), or could not be represented in the encoding used by the source-code file. This is the reason for using escape sequences. Regular expressions need also backslashes for some special sequences used only for the regular expressions. Because of that `r'raw string literals'` are prefered. – pepr Jun 19 '16 at 12:51