Add a backslash to all non-word characters in a string

Question

Since I want to make each element of string in a list (url_key) as a regex to identify whether an element in another list has a pattern, I need to add a backslash to all non-word characters for every element in url_key using python.

Example of my code:

import re
sentences = ["Disallow DCCP sockets due to such NFC-3456",
            "Check at http://www.n.io/search?query=title++sub/file.html",
            "Specifies the hash algorithm on them"]

url_key = ['www.n.io/search?query=title++sub', 'someweb.org/dirs.io']    # there are thousands of elements
add_key = ['NFC-[0-9]{4}', 'CEZ-[0-9a-z]{4,8}']    # there are hundreds of elements

pattern = url_key + add_key
mykey = re.compile('(?:% s)' % '|'.join(pattern))

for item in sentences:
    if mykey.search(item):
        print (item, ' --> Keyword is found')
    else:
        print (item, ' --> Keyword is not Found')

But this code gives me an error:

error                             Traceback (most recent call last)
<ipython-input-80-5348ee9c65ec> in <module>()
      8 
      9 pattern = url_key + add_key
---> 10 mykey = re.compile('(?:% s)' % '|'.join(pattern))
     11 
     12 for item in sentences:

~/anaconda3/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/anaconda3/lib/python3.6/re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

~/anaconda3/lib/python3.6/sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

~/anaconda3/lib/python3.6/sre_parse.py in parse(str, flags, pattern)
    853 
    854     try:
--> 855         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    856     except Verbose:
    857         # the VERBOSE flag was switched on inside the pattern.  to be

~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
    763                 sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
    764                                not (del_flags & SRE_FLAG_VERBOSE))
--> 765                 p = _parse_sub(source, state, sub_verbose, nested + 1)
    766             if not source.match(")"):
    767                 raise source.error("missing ), unterminated subpattern",

~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
    617             if item[0][0] in _REPEATCODES:
    618                 raise source.error("multiple repeat",
--> 619                                    source.tell() - here + len(this))
    620             if sourcematch("?"):
    621                 subpattern[-1] = (MIN_REPEAT, (min, max, item))

error: multiple repeat at position 31

Expected result:

Disallow DCCP sockets due to such NFC-3456 --> Keyword is found 
Check at http://www.n.io/search?query=title++sub/file.html --> Keyword is found
Specifies the hash algorithm on them --> Keyword is not found

Any help would be appreciated. Thanks.

Can you explain why you need to do this? This sounds like [an XY problem](https://meta.stackexchange.com/q/66377/322040) you're trying to hack by hand, when an existing string escaping API would do the trick (without necessarily behaving exactly the way you describe). Also, saying "didn't work" isn't very helpful; a [MCVE] should show the observed output (including traceback if an exception was raised). — ShadowRanger, Dec 05 '19 at 00:23
I agree with @ShadowRanger entirely. Be careful, take your time, don’t just jump on something because “it works”. — AMC, Dec 05 '19 at 03:14

Selcuk · Answer 1 · 2019-12-05T00:31:43.763

1

You should either use raw strings:

result = re.sub('(\W)', r'\\\1', mystring)

or escape backslashes too:

result = re.sub('(\W)', '\\\\\\1', mystring)

edited Dec 05 '19 at 00:31

answered Dec 05 '19 at 00:27

Selcuk

57,004
12
102
110

score 0 · Accepted Answer · answered Dec 05 '19 at 00:28

0

Your main problem is that string escapes are taking effect before the regex substitution escapes. Switching to raw strings (to inhibit string escapes) and escaping your backslash (because \\ is itself a substitution escape) will fix this:

>>> print(re.sub(r'(\W)', r'\\\1', '?:n.io/search?query=title++sub'))
\?\:n\.io\/search\?query\=title\+\+sub

Note that you may not need such extensive escaping. If you just want to escape regex special characters, re.escape will do this for you:

>>> print(re.escape('?:n.io/search?query=title++sub'))
\?:n\.io/search\?query=title\+\+sub

without adding unnecessary escapes (ones that aren't needed to despecialize regex characters).

answered Dec 05 '19 at 00:28

ShadowRanger

143,180
12
188
271

what if I want to save that result into a list, instead of just printing? E.g. `lst = ['\?:n\.io\/search\?query=title\+\+sub']` – YusufUMS Dec 05 '19 at 00:49
@YusufUMS: Just wrap it in square brackets instead of `print()`? `[re.escape('?:n.io/search?query=title++sub')]` – ShadowRanger Dec 05 '19 at 01:15
@YusufUMS Do you know how to create a list? – AMC Dec 05 '19 at 03:16

Add a backslash to all non-word characters in a string

2 Answers2