Using re.findall()
, I'm attempting to find all occurrences of each term from a list of terms, in a string.
If a particular term contains special characters (i.e. a '+'
), a match will not be found, or error messages may be generated. Using re.escape()
, the error messages are avoided, but the terms with special characters are not found within the string.
import re
my_list = ['java', 'c++', 'c#', '.net']
my_string = ' python javascript c++ c++ c# .net java .net'
matches = []
for term in my_list:
if any(x in term for x in ['+', '#', '.']):
term = re.escape(term)
print "\nlooking for term '%s'" % term
match = re.findall("\\b" + term + "\\b", my_string, flags = re.IGNORECASE)
matches.append(match)
The above code will only find 'java' within the string. Any suggestions regarding, how to find terms with special characters within the string?
Caveat: I cannot change 'my_list' manually, because I don't know in advance what terms it will contain.
Update - it appears that the problem has to do with the word boundary specifiers within the regex (the "\b") breaking up the string along characters which include the non-alphanumeric chars included in the string. It's unclear how to solve this in a clean and straightforward way, however.
Edit - this question is not a duplicate of this - it already incorporates the most applicable solution from that post.