Processing accented Unicode characters with python regex module

Question

I have following two functions that works perfectly fine with ASCII strings and use the re module:

import re

def findWord(w):
    return re.compile(r'\b{0}.*?\b'.format(w), flags=re.IGNORECASE).findall


def replace_keyword(w, c, x):
    return re.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=re.I)

However, they fail on using the utf-8 encoded strings with accented characters. On searching further, I found that the regex module is better suited for Unicode strings and hence I have been trying to port this to use regex for the last couple of hours but nothing seem to be working. This is what I have as of now:

import regex

def findWord(w):
    return regex.compile(r'\b{0}.*?\b'.format(w), flags=regex.IGNORECASE|regex.UNICODE).findall

def replace_keyword(w, c, x):
    return regex.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=regex.IGNORECASE|regex.UNICODE)

However, on using an accented (not normalized) utf-8 encoded string, I keep getting an ordinal not in range error.

EDIT: The suggested possible duplicate question: Regular expression to match non-English characters? doesn't solve my problem. I want to use the python re/regex module. Secondly, I want to get the find and replace functions working using python.

EDIT: I am using python 2

EDIT: If you feel you can help me get these two functions working using Python 3 please let me know. I hope I will be able to invoke python 3 for using just these 2 functions through my python 2 script.

"they fail on using the utf-8 encoded strings" Yes, yes they do. This is to be expected since they work on text and UTF-8 encoded strings aren't text. — Ignacio Vazquez-Abrams, Aug 03 '15 at 03:06
possible duplicate of [Regular expression to match non-English characters?](http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters) — Izzy, Aug 03 '15 at 03:07
Are you using Python 2 or 3? What do you mean by "UTF-8 encoded string"? In Python 2, strings are ASCII-only, in Python 3 strings allow any Unicode codepoint. Encodings like UTF-8 are relevant when reading in or writing out text, inside Python a string doesn't have an encoding, per se. — dimo414, Aug 03 '15 at 03:20
@dimo414: Thanks for this info "inside Python a string doesn't have an encoding, per se". In short, I have accented characters present in my string and I want to get these two functions (find and replace) working for them in python 2 — The Wanderer, Aug 03 '15 at 03:22
@TheWanderer: You need to operate on Unicode string and enable re.UNICODE to make the tokens `\b`, `\w`, `\d`, `\s` work with Unicode character. — nhahtdh, Aug 03 '15 at 03:57
@nhahtdh: I am not sure if I understand correctly. Will I be able to preserve the accents ? — The Wanderer, Aug 03 '15 at 12:44
This is a great question, but if you are dealing with non-ASCII text I highly recommend that you move to Python 3. — dotancohen, Aug 03 '15 at 12:56
@dotancohen: Can you please try to get this working using Python 3? I am assuming I will be able to invoke python 3 script from my python 2 which will just run these two functions. — The Wanderer, Aug 03 '15 at 14:04

score 0 · Answer 1 · answered Aug 03 '15 at 15:06

I think I am headed somewhere. I am trying to get this working without using the modules re or regex but plain python:

found_keywords = []
for word in keyword_list:
    if word.lower() in article_text.lower():
         found_keywords.append(word)

for word in found_keywords:  # highlight the found keyword in the text
    article_text = article_text.lower().replace(word.lower(), '<mark style="background-color:%s">%s</mark>' % (yellow_color, word))

Now, I just have to somehow replace found keywords in a case-insensitive manner and I will be good to go.

Just help me with this last step of replacing keywords in a case-insensitive manner without using re or regex so that it works for accented strings.

Processing accented Unicode characters with python regex module

1 Answers1