I am using python 2.7 for NLP in Bodo Language (uses Devnagari script)
In the process of stop word removal, I made a list of stop words in a file separated by newline ("\n"). I used codecs module to read this file and convert to a list.
raw_txt = codecs.open('stopwords.txt', 'r', 'utf-8')
stopWords = []
while(1):
line = raw_txt.readline()
if not line:
break
line = u''.join(line.strip())
stopWords.append(line)
Now I compiled a regular expression to find the matched words:
def addWordBoundary(word):
return u''.join(r"\b" + word + r"\b")
reg = regex.compile(r"(%s)" % "|".join(map(addWordBoundary, stopWords)), regex.UNICODE)
I read the corpus (text file) using codecs module to a string and regex.sub() and then write it to a file using codecs itself. But it missed some words. I could not figure out why.
fl = codecs.open('corpus.txt', 'r', 'utf-8')
rawFile = fl.read()
cleanText = reg.sub('', rawFile, regex.U)
wr = codecs.open('output.txt', 'w', 'utf-8')
wr.write(cleanText)
wr.close()
For testing purpose use this as both stopwords.txt and corpus.txt
माब्लानिफ्रायथो
फारसेनिफ्रायबो
रावनिफ्रायबो
माब्लानिफ्राय
जेब्लानिफ्राय
अब्लानिफ्राय
इफोरनिफ्राय
नोंनिफ्रायबो
फारसेनिफ्राय
नोंनिफ्रायनो
The output.txt file must be a empty file, but it contains:
रावनिफ्रायबो
इफोरनिफ्राय
This code works good for English text (ASCII), so may be I am doing something wrong with utf-8 processing. Please suggest.