python regex module not working with utf-8 (Devnagari)

Question

I am using python 2.7 for NLP in Bodo Language (uses Devnagari script)

In the process of stop word removal, I made a list of stop words in a file separated by newline ("\n"). I used codecs module to read this file and convert to a list.

raw_txt = codecs.open('stopwords.txt', 'r', 'utf-8')
stopWords = []
while(1):
    line = raw_txt.readline()
    if not line:
        break
    line = u''.join(line.strip())
    stopWords.append(line)

Now I compiled a regular expression to find the matched words:

def addWordBoundary(word):       
    return u''.join(r"\b" + word + r"\b")

reg = regex.compile(r"(%s)" % "|".join(map(addWordBoundary, stopWords)), regex.UNICODE)

I read the corpus (text file) using codecs module to a string and regex.sub() and then write it to a file using codecs itself. But it missed some words. I could not figure out why.

fl = codecs.open('corpus.txt', 'r', 'utf-8')
rawFile = fl.read()

cleanText = reg.sub('', rawFile, regex.U)

wr = codecs.open('output.txt', 'w', 'utf-8')
wr.write(cleanText)
wr.close()

For testing purpose use this as both stopwords.txt and corpus.txt

माब्लानिफ्रायथो
फारसेनिफ्रायबो
रावनिफ्रायबो
माब्लानिफ्राय
जेब्लानिफ्राय
अब्लानिफ्राय
इफोरनिफ्राय
नोंनिफ्रायबो
फारसेनिफ्राय
नोंनिफ्रायनो

The output.txt file must be a empty file, but it contains:

रावनिफ्रायबो
इफोरनिफ्राय

This code works good for English text (ASCII), so may be I am doing something wrong with utf-8 processing. Please suggest.

I don't know why it fails, but when I tested it, it had even more entries in the output. All the missed stopwords seem to contain a combining character (e.g. थो, which is थ + ◌ो ). Perhaps the word boundary detector `\b` doesn't function correctly with combining characters. — jogojapan, May 26 '13 at 10:40
@jogojapan I don't think combining character is the issue, because there are many of them which works like: फ + र = फ्र , न + ‍‍ो = नो etc. This may be a bug!!! — srajbr, May 26 '13 at 13:04
duplicate of http://stackoverflow.com/questions/16579113/regular-expression-doesnt-work-properly-with-turkish-characters — prash, May 29 '13 at 09:50

score 2 · Answer 1 · answered Jun 02 '13 at 20:13

Some of the stop words in the file you provided start or end with characters that are not defined as alphanumeric:

import unicodedata as ud
for w in stopWords:
    for c in w[0], w[-1]:
        print repr(c), ud.category(c),
    print

u'\u092e' Lo u'\u094b' Mc
u'\u092b' Lo u'\u094b' Mc
u'\ufeff' Cf u'\u094b' Mc
u'\u092e' Lo u'\u092f' Lo
u'\u091c' Lo u'\u092f' Lo
u'\u0905' Lo u'\u092f' Lo
u'\ufeff' Cf u'\u092f' Lo
u'\u0928' Lo u'\u094b' Mc
u'\u092b' Lo u'\u092f' Lo
u'\u0928' Lo u'\u094b' Mc

In particular, two lines – the ones you are seeing in output.txt – start with u'\ufeff':

ud.name(u'\ufeff') == 'ZERO WIDTH NO-BREAK SPACE'

This is also known as a byte order mark (BOM) and is sometimes used at the start of a file to identify the encoding. Here, it has probably accidentally been included inside the file when editing it. Python does appear to remove the character if it is at the very start of the file, but not when it appears elsewhere in the file. strip() is apparently not enough to remove it either. These characters should be removed from the input file manually.

I am also getting the ones ending in u'\u094b' (DEVANAGARI VOWEL SIGN O) in the output, so my copy of Python apparently does not treat these as alphanumeric characters.

Also, in general, when you want to match exact strings in a regular expression, you should use re.escape(string) before inserting it into the regular expression in case the string contains characters that would be treated as regular expression metacharacters.

python regex module not working with utf-8 (Devnagari)

1 Answers1