-2

Since I want to make each element of string in a list (url_key) as a regex to identify whether an element in another list has a pattern, I need to add a backslash to all non-word characters for every element in url_key using python.

Example of my code:

import re
sentences = ["Disallow DCCP sockets due to such NFC-3456",
            "Check at http://www.n.io/search?query=title++sub/file.html",
            "Specifies the hash algorithm on them"]

url_key = ['www.n.io/search?query=title++sub', 'someweb.org/dirs.io']    # there are thousands of elements
add_key = ['NFC-[0-9]{4}', 'CEZ-[0-9a-z]{4,8}']    # there are hundreds of elements

pattern = url_key + add_key
mykey = re.compile('(?:% s)' % '|'.join(pattern))

for item in sentences:
    if mykey.search(item):
        print (item, ' --> Keyword is found')
    else:
        print (item, ' --> Keyword is not Found')

But this code gives me an error:

error                             Traceback (most recent call last)
<ipython-input-80-5348ee9c65ec> in <module>()
      8 
      9 pattern = url_key + add_key
---> 10 mykey = re.compile('(?:% s)' % '|'.join(pattern))
     11 
     12 for item in sentences:

~/anaconda3/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/anaconda3/lib/python3.6/re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

~/anaconda3/lib/python3.6/sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

~/anaconda3/lib/python3.6/sre_parse.py in parse(str, flags, pattern)
    853 
    854     try:
--> 855         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    856     except Verbose:
    857         # the VERBOSE flag was switched on inside the pattern.  to be

~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
    763                 sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
    764                                not (del_flags & SRE_FLAG_VERBOSE))
--> 765                 p = _parse_sub(source, state, sub_verbose, nested + 1)
    766             if not source.match(")"):
    767                 raise source.error("missing ), unterminated subpattern",

~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
    617             if item[0][0] in _REPEATCODES:
    618                 raise source.error("multiple repeat",
--> 619                                    source.tell() - here + len(this))
    620             if sourcematch("?"):
    621                 subpattern[-1] = (MIN_REPEAT, (min, max, item))

error: multiple repeat at position 31

Expected result:

Disallow DCCP sockets due to such NFC-3456 --> Keyword is found 
Check at http://www.n.io/search?query=title++sub/file.html --> Keyword is found
Specifies the hash algorithm on them --> Keyword is not found

Any help would be appreciated. Thanks.

Selcuk
  • 57,004
  • 12
  • 102
  • 110
YusufUMS
  • 1,506
  • 1
  • 12
  • 24
  • 2
    Can you explain why you need to do this? This sounds like [an XY problem](https://meta.stackexchange.com/q/66377/322040) you're trying to hack by hand, when an existing string escaping API would do the trick (without necessarily behaving exactly the way you describe). Also, saying "didn't work" isn't very helpful; a [MCVE] should show the observed output (including traceback if an exception was raised). – ShadowRanger Dec 05 '19 at 00:23
  • I agree with @ShadowRanger entirely. Be careful, take your time, don’t just jump on something because “it works”. – AMC Dec 05 '19 at 03:14

2 Answers2

1

You should either use raw strings:

result = re.sub('(\W)', r'\\\1', mystring)

or escape backslashes too:

result = re.sub('(\W)', '\\\\\\1', mystring)
Selcuk
  • 57,004
  • 12
  • 102
  • 110
0

Your main problem is that string escapes are taking effect before the regex substitution escapes. Switching to raw strings (to inhibit string escapes) and escaping your backslash (because \\ is itself a substitution escape) will fix this:

>>> print(re.sub(r'(\W)', r'\\\1', '?:n.io/search?query=title++sub'))
\?\:n\.io\/search\?query\=title\+\+sub

Note that you may not need such extensive escaping. If you just want to escape regex special characters, re.escape will do this for you:

>>> print(re.escape('?:n.io/search?query=title++sub'))
\?:n\.io/search\?query=title\+\+sub

without adding unnecessary escapes (ones that aren't needed to despecialize regex characters).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271