15

Say, for example, I want to know whether the pattern "\section" is in the text "abcd\sectiondefghi". Of course, I can do this:

import re

motif = r"\\section"
txt = r"abcd\sectiondefghi"
pattern = re.compile(motif)
print pattern.findall(txt)

That will give me what I want. However, each time I want to find a new pattern in a new text, I have to change the code which is painful. Therefore, I want to write something more flexible, like this (test.py):

import re
import sys

motif = sys.argv[1]
txt = sys.argv[2]
pattern = re.compile(motif)
print pattern.findall(txt)

Then, I want to run it in terminal like this:

python test.py \\section abcd\sectiondefghi

However, that will not work (I hate to use \\\\section).

So, is there any way of converting my user input (either from terminal or from a file) to python raw string? Or is there a better way of doing the regular expression pattern compilation from user input?

Thank you very much.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
dbrg77
  • 165
  • 1
  • 1
  • 5
  • Note that this question isn't really about converting a string to a raw string. The problem here is simply that the shell running in the terminal requires you to escape the command line arguments. Most shells will turn that `\s` in `cd\sec` into a regular old `s` character. No matter what you do in the python code, you won't be able to tell that there was ever a backslash in front of that `s`. For a question that really is about turning special characters in a string into escape sequences, see [here](//stackoverflow.com/q/2428117). – Aran-Fey Oct 10 '18 at 18:25

3 Answers3

28

Use re.escape() to make sure input text is treated as literal text in a regular expression:

pattern = re.compile(re.escape(motif))

Demo:

>>> import re
>>> motif = r"\section"
>>> txt = r"abcd\sectiondefghi"
>>> pattern = re.compile(re.escape(motif))
>>> txt = r"abcd\sectiondefghi"
>>> print pattern.findall(txt)
['\\section']

re.escape() escapes all non-alphanumerics; adding a backslash in front of each such a character:

>>> re.escape(motif)
'\\\\section'
>>> re.escape('\n [hello world!]')
'\\\n\\ \\[hello\\ world\\!\\]'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 2
    On the other hand, if you're searching for literal strings, re is the wrong tool. – Fredrik Jul 24 '13 at 10:47
  • @Fredrik: I was assuming this was going to be part of a larger pattern and the OP just simplified. – Martijn Pieters Jul 24 '13 at 10:49
  • @Fredrik: I am interested in more degenerate search patterns and want to use re. The code I provided is just a simplified example. – dbrg77 Jul 24 '13 at 12:43
  • Then your question doesn't really make sense -- if you need to input a regexp to your program, you need to input the actual regexp. The "raw string" syntax is only relevant if you're writing Python code (it controls how backslashes in the string literal are interpreted by Python when it constructs the string, but has no effect on the regexp machinery). Is your problem perhaps that the shell is adding an extra layer of escaping? – Fredrik Jul 24 '13 at 12:51
  • (i.e the shell also parses backslashes, so if you type \\ at the command line, Python only sees one of them -- a simple workaround is to put the pattern inside single quotes, which disables escaping) – Fredrik Jul 24 '13 at 12:52
  • @Fredrik: No, the problem is that people think raw string literals are a requirement for building regular expressions. – Martijn Pieters Jul 24 '13 at 12:53
  • @Fredrik: The *real* problem here is interpolating user input into a regular expression, where meta characters should not be interpreted as such but as literal text instead. – Martijn Pieters Jul 24 '13 at 12:54
  • The original question contained a shell example, so it's far from clear that the shell isn't part of the confusion here. But yeah, whatever the problem is, raw strings have nothing to do with it. – Fredrik Jul 24 '13 at 13:02
  • @Fredrik: As Martijn said, I just started to learn python, and according to regular expression HOWTO: http://docs.python.org/2/howto/regex.html . It is better to use raw string to construct re. That's what my question is really about. – dbrg77 Jul 24 '13 at 13:04
  • 1
    Raw strings are a way to turn off the escaping rules for string literals in Python code (just like single quotes are a way to turn off the shell's escaping rules in the shell), and has nothing to do with the regexp machinery, beyond being convenient when you write Python code. But if you get the regexp from outside Python, it's not Python code... – Fredrik Jul 24 '13 at 13:13
  • Found via ddg search for `python turn string into literal for regex`. Thanks! – bgStack15 Oct 19 '16 at 18:29
  • I don't like this answer because it doesn't address the actual problem - an extra level of escaping is required because of the shell. Blindly escaping everything isn't a solution. It's a bug. You can't retroactively fix the mistakes the user made when they entered the command line arguments. – Aran-Fey Oct 10 '18 at 16:54
  • @Aran-Fey: absolutely not, this is *not* a shell escaping issue. The end user is not told that they are entering a regex, so for *plain text input* you absolutely *must* use `re.escape()` to ensure that metacharacters in the plain text are not interpreted as such. – Martijn Pieters Oct 10 '18 at 17:45
  • I guess you have a point, but it's not actually clear whether the user knows they're entering a regex. If the user thinks they're entering plain text, why is the OP even using regex to perform the search? That makes no sense. It's more reasonable to assume that the user is *supposed* to enter a regex. – Aran-Fey Oct 10 '18 at 17:49
  • @Aran-Fey: the example can easily have been simplified down to basics. – Martijn Pieters Oct 10 '18 at 17:50
  • I still think you should've addressed it. The shell command seen in the question makes no sense, since any regular shell will turn that `\s` into a normal `s` character. No matter what you do in the python code, the input will be a plain old `s` indistinguishable from any other `s`. – Aran-Fey Oct 10 '18 at 17:56
2

One way to do this is using an argument parser, like optparse or argparse.

Your code would look something like this:

import re
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-s", "--string", dest="string",
                  help="The string to parse")
parser.add_option("-r", "--regexp", dest="regexp",
                  help="The regular expression")
parser.add_option("-a", "--action", dest="action", default='findall',
                  help="The action to perform with the regexp")

(options, args) = parser.parse_args()

print getattr(re, options.action)(re.escape(options.regexp), options.string)

An example of me using it:

> code.py -s "this is a string" -r "this is a (\S+)"
['string']

Using your example:

> code.py -s "abcd\sectiondefghi" -r "\section"
['\\section'] 
# remember, this is a python list containing a string, the extra \ is okay.
Inbar Rose
  • 41,843
  • 24
  • 85
  • 131
2

So just to be clear, is the thing you search for ("\section" in your example) supposed to be a regular expression or a literal string? If the latter, the re module isn't really the right tool for the task; given a search string needle and a target string haystack, you can do:

# is it in there
needle in haystack

# how many copies are there
n = haystack.count(needle)
python test.py \\section abcd\sectiondefghi
# where is it
ix = haystack.find(needle)

all of which are more efficient than the regexp-based version.

re.escape is still useful if you need to insert a literal fragment into a larger regexp at runtime, but if you end up doing re.compile(re.escape(needle)), there are for most cases better tools for the task.

EDIT: I'm beginning to suspect that the real issue here is the shell's escaping rules, which has nothing to do with Python or raw strings. That is, if you type:

python test.py \\section abcd\sectiondefghi

into a Unix-style shell, the "\section" part is converted to "\section" by the shell, before Python sees it. The simplest way to fix that is to tell the shell to skip unescaping, which you can do by putting the argument inside single quotes:

python test.py '\\section' 'abcd\sectiondefghi'

Compare and contrast:

$ python -c "import sys; print ','.join(sys.argv)" test.py \\section abcd\sectiondefghi
-c,test.py,\section,abcdsectiondefghi

$ python -c "import sys; print ','.join(sys.argv)" test.py '\\section' 'abcd\sectiondefghi'
-c,test.py,\\section,abcd\sectiondefghi

(explicitly using print on a joined string here to avoid repr adding even more confusion...)

Fredrik
  • 940
  • 4
  • 10