I have following two functions that works perfectly fine with ASCII strings and use the re
module:
import re
def findWord(w):
return re.compile(r'\b{0}.*?\b'.format(w), flags=re.IGNORECASE).findall
def replace_keyword(w, c, x):
return re.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=re.I)
However, they fail on using the utf-8
encoded strings with accented characters. On searching further, I found that the regex
module is better suited for Unicode strings and hence I have been trying to port this to use regex
for the last couple of hours but nothing seem to be working. This is what I have as of now:
import regex
def findWord(w):
return regex.compile(r'\b{0}.*?\b'.format(w), flags=regex.IGNORECASE|regex.UNICODE).findall
def replace_keyword(w, c, x):
return regex.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=regex.IGNORECASE|regex.UNICODE)
However, on using an accented (not normalized) utf-8
encoded string, I keep getting an ordinal not in range
error.
EDIT: The suggested possible duplicate question: Regular expression to match non-English characters? doesn't solve my problem. I want to use the python re
/regex
module. Secondly, I want to get the find
and replace
functions working using python.
EDIT: I am using python 2
EDIT: If you feel you can help me get these two functions working using Python 3 please let me know. I hope I will be able to invoke python 3 for using just these 2 functions through my python 2 script.