I'm loading a text file with a plain text version of the Afrikaans wikipedia as a nltk corpus, using the following code:
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.util import LazyCorpusLoader
from __future__ import division
afwikipedia = LazyCorpusLoader('afwikipedia', PlaintextCorpusReader, r'(?!\.).*\.txt')
af = nltk.Text(afwikipedia.words())
I then look at the top words using the following:
from nltk.probability import FreqDist
fdist = FreqDist(af)
vocabulary = fdist.keys()
vocabulary[:250] # 250 most frequently used words.
Unfortunately this method has several problems. "'n" is a very popular word in Afrikaans, which means the same as "a" in English. The method above splits it into two parts "'" and "n". Also all extended ASCII characters seem to be treated as unicode instead of ascii, so "verpleƫr" becomes "verple\xc3r".
Does anyone know how I would go about fixing this? Especially the unicode treatment of an ascii character is really annoying.
I have also done the following:
# Create a file called sitecustomize.py in c:\python24\Lib\site-packages.
import sys
sys.setdefaultencoding('iso-8859-1') # ASCII latin.