I'm trying to compute the frequencies of words in an utf-8 encoded text file with the following code. Having successfully tokenized the file content and then looping through the words, my program is not able to read the accented characters.
import csv
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
print "computing word frequency..."
if lang == "fr":
stop = stopwords.words("french")
stop = [word.encode("utf-8") for word in stop]
stop.append("les")
stop.append("a")
elif lang == "en":
stop = stopwords.words("english")
rb = csv.reader(open(path+file_name))
wb = csv.writer(open('results/all_words_'+file_name,'wb'))
tokenizer = RegexpTokenizer(r'\w+')
word_dict = {}
i = 0
for row in rb:
i += 1
if i == 5:
break
text = tokenizer.tokenize(row[0].lower())
text = [j for j in text if j not in stop]
#print text
for doc in text:
try:
try:
word_dict[doc] += 1
except:
word_dict[doc] = 1
except:
print row[0]
print " ".join(text)
word_dict2 = sorted(word_dict.iteritems(), key=operator.itemgetter(1), reverse=True)
if lang == "English":
for item in word_dict2:
wb.writerow([item[0],stem(item[0]),item[1]])
else:
for item in word_dict2:
wb.writerow([item[0],item[1]])
print "Finished"
Input text file:
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent
Results output:
crepes,2
dimanche,2
rt,1
nouveau,1
envie,1
v�,1
jerrylee,1
cleantext,1
lo,1
bonnes,1
tour,1
crêpes,1
monde,1
bonjour,1
annesorose,1
envoy�,1
envoy� is envoyé in the actual file.
How can I correct this problem with accented characters?