0

I am using the library re import re, and I have a string that in human readable format (if you were to find it on a printed book and you see it with your eyes) looks like this:

\t

It is not a tab, it's a slash t. I know python will think it's a tab if I create this code a="\t" but for that pythonic purpose, I do this a="\\t" to represent a real world slash t (not tab, but a literal slash t). And the only pattern that matches string a="\\t" is this pattern='\\\\t' . Why so many backslashes on regex to match a literal slash t?

I want to regex that literal slash t. I've noticed that trying

  1. re.findall(pattern='\t',string='\\t') : Doesn't matches (this makes sense to me)
  2. re.findall(pattern='\\t',string='\\t') : Doesn't matches (this does not makes sense)
  3. re.findall(pattern='\\\t',string='\\t') : Doesn't matches (this I think it would match a slash tab)
  4. re.findall(pattern='\\\\t',string='\\t') : Finally it matches

So I want to know why so many backslashes.

I know that:

  1. General in python, a single \ is a escape character only if two conditions are met: It's followed by a special character and the combination of the slash with its immediate character is NOT a special sequence. If both conditions are false, then it's just a literal backslash. For example:
    1. Backslash with special character:

      This code a="\'" in human format is just a single semicolon, because the backslash is suppressing the role of a python semicolon as a string delimiter, and it makes it a literal semicolon. In other words, if you send the value inside variable a to a printer machine (a physical one) your piece of paper will just have a semicolon and nothing else.

    2. Backslash with no special character:

      This code a="\o" in human readable it's a backslash followed by the letter o, because the letter o is not a special character, therefore the backslash preceding the o is acting as a literal backslash. If we send the variable a to a printer, our paper would have a \ followed by letter o

    3. Backslash with no special character, but the combination is a special sequence:

      The letter t is not a special character in python, so if we would apply the logic of the previous point, this code a="\t" could be thought to be a slash t in human readable. however, that's not the case. This is the third rule: the combination of the backslash with it's immediate character must not be a special sequence. In python, the combination of slash and t is a special sequence that represents a tab. So if we the send variable to a printer we get a white paper, although our printer "printed" a tab (which doesn't uses ink, so that's why the paper is white). Similar logic applies with the special sequence \n which is new line

  2. If you write a notepad with this string `Some text \t and then more` and read it on python while storing it on a variable, then if you print the variable, depending how you print and your IDE you can get "Some text \\t and then more" or "Some text \t and then more" Regardless of how your IDE represented it to you visually, python knows the slash t is NOT a tab since it made a literal scan of your file and a literal \t on a file is not a tab, only a tab is a tab on notepad. Since a tab in notepad has some binary representation different from the binary representation of a slash and a t together. In few words, python doesn't thinks the \t of the notepad is a tab, it's a literal slash t

So having said this, If I write this code a="some text \\t and then some more" I get the same as reading the notepad. And here comes the question:

Why it takes four backslashes? What is each slash doing, or are all 4 together the escape sequence of special sequences? And what would cases 2 and 3 even match, what example strings?

This question makes me think pattern 2 re.findall(pattern='\\t',string='\\t') should have matched, it says two slashes \\ is a slash Python escaping backslash

  • 1
    for reference: [the-backslash-plague](https://docs.python.org/3/howto/regex.html#the-backslash-plague) – topsail Aug 03 '23 at 01:27
  • 1
    This answer https://stackoverflow.com/a/4025505/2453382 clearly explains how \\\\ becomes a single backslash –  Aug 03 '23 at 01:29
  • Not entirely If I understand that answer. From it, I would infer than '\\\\t' becomes a literal search for two slashes t, not one slash t – Eugenio.Gastelum96 Aug 03 '23 at 01:34
  • 1
    As the answer explains, there are two **separate and independent** steps: first Python takes the source code in order to figure out what's in the string, then the regex engine takes the string to figure out what the regex should match. When the regex engine is interpreting the string, it applies its own escaping rules. `'\\\\t'` becomes a *regex pattern* that contains two backslashes, but that pattern *matches* one backslash. – Karl Knechtel Aug 03 '23 at 01:36
  • 1
    Long story short, use raw string literal syntax to minimize the number of backslashes (or backlashes `:P`). – InSync Aug 03 '23 at 01:36

0 Answers0