3

Using re.findall(), I'm attempting to find all occurrences of each term from a list of terms, in a string.

If a particular term contains special characters (i.e. a '+'), a match will not be found, or error messages may be generated. Using re.escape(), the error messages are avoided, but the terms with special characters are not found within the string.

import re         
my_list = ['java', 'c++', 'c#', '.net']
my_string = ' python javascript c++ c++ c# .net java .net'
matches = []

for term in my_list:
    if any(x in term for x in ['+', '#', '.']):
        term = re.escape(term)

    print "\nlooking for term '%s'" % term 
    match = re.findall("\\b" + term + "\\b", my_string, flags = re.IGNORECASE)
    matches.append(match)

The above code will only find 'java' within the string. Any suggestions regarding, how to find terms with special characters within the string?

Caveat: I cannot change 'my_list' manually, because I don't know in advance what terms it will contain.

Update - it appears that the problem has to do with the word boundary specifiers within the regex (the "\b") breaking up the string along characters which include the non-alphanumeric chars included in the string. It's unclear how to solve this in a clean and straightforward way, however.

Edit - this question is not a duplicate of this - it already incorporates the most applicable solution from that post.

Community
  • 1
  • 1
Boa
  • 2,609
  • 1
  • 23
  • 38
  • 1
    Escape them by prepending a backslash `\+` – Torxed May 14 '15 at 16:46
  • 1
    + means something very specific in regex ... you could try escaping it `C\+\+` – Joran Beasley May 14 '15 at 16:46
  • 3
    In `'c++'` the word boundary occurs between `c` and `+`. In your regex you are putting the boundary after the last `+`. – Steven Rumbalski May 14 '15 at 16:48
  • @JoranBeasley - that's essentially what re.escape() does, isn't it? As stated, re.escape() takes care of the error message, but doesn't help to produce matches for strings that contain characters that are not alphanumeric. I'll clarify that in the post. – Boa May 14 '15 at 16:51
  • @StevenRumbalski - yes, but 'term = re.escape(term)' mutates the term which is being passed to re.findall() at the given iteration. – Boa May 14 '15 at 16:53
  • possible duplicate of [How to escape special characters of a string with single backslashes](http://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes) – Torxed May 14 '15 at 16:54
  • 1
    Your problem is not the escape characters but you \\b you have around the term. if you take them out you will get matches.. I understand you want to only allow full word matches, but the escaped terms are not your issue. looking into it more. – Rob May 14 '15 at 16:57
  • 1
    Steven Rumbalski is correct, the problem is the \\b next to a non word (\w) character as they will constitute a word boundary and hence you cannot really do this nicely.. You would better off tokenizing the string and using your pattern on each token without the \\b in it. – Rob May 14 '15 at 17:09
  • @Rob - thanks, Rob. Indeed, it seems that the problem has to do with how the \\b interacts with non-alphanumerics. Kind of hard to conceive that there's no cleaner solution to this than tokenizing the string, though. – Boa May 14 '15 at 17:19

1 Answers1

1
import re
my_list = ['java', 'c++', 'c#', '.net']
my_string = ' python javascript c++ c++ c# .net java .net'
matches = []

for term in my_list:
    if any(x in term for x in ['+', '#', '.']):
        term = re.escape(term)

    print "\nlooking for term '%s'" % term
    match = re.findall(r"(?:^|(?<=\s))"+term+r"(?=\s|$)", my_string, flags = re.IGNORECASE)
    matches.append(match)

Try this.The problem is \b which is word boundary.In C++ there is no word boundary after +.So it will not match.Similarly for others.

vks
  • 67,027
  • 10
  • 91
  • 124