How to prevent an nltk corpus from reading extended ascii as unicode

Question

I'm loading a text file with a plain text version of the Afrikaans wikipedia as a nltk corpus, using the following code:

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.util import LazyCorpusLoader
from __future__ import division
afwikipedia = LazyCorpusLoader('afwikipedia', PlaintextCorpusReader, r'(?!\.).*\.txt')
af = nltk.Text(afwikipedia.words())

I then look at the top words using the following:

from nltk.probability import FreqDist
fdist = FreqDist(af)
vocabulary = fdist.keys()
vocabulary[:250]   # 250 most frequently used words.

Unfortunately this method has several problems. "'n" is a very popular word in Afrikaans, which means the same as "a" in English. The method above splits it into two parts "'" and "n". Also all extended ASCII characters seem to be treated as unicode instead of ascii, so "verpleër" becomes "verple\xc3r".

Does anyone know how I would go about fixing this? Especially the unicode treatment of an ascii character is really annoying.

I have also done the following:

# Create a file called sitecustomize.py in c:\python24\Lib\site-packages.
import sys
sys.setdefaultencoding('iso-8859-1')   # ASCII latin.

score 0 · Accepted Answer · answered Feb 03 '13 at 22:18

That's not unicode, it's ascii with 8-bit characters mixed in. PlaintextCorpusReader takes an encoding argument, which you can use to solve your problem.

As for breaking up the ' from the n, that's a matter for the tokenizer. Find a tokenizer that works to your satisfaction and tell your corpus reader to use it.

How to prevent an nltk corpus from reading extended ascii as unicode

1 Answers1