Handling Character Encoding Problems in Python using NLTK

Question

I have downloaded and cleaned up a set of RSS feeds to be used as a Corpus with NLTK for testing classification. But when I run the Frequency Distribution many of the top results seem to be special characters:

<FreqDist: '\x92': 494, '\x93': 300, '\x97': 159, ',\x94': 134, 'company': 124, '.\x94': 88, 'app': 84, 'Twitter': 82, 'people': 76, 'time': 73, ...>

I tried the suggestion in the question here and initialized the corpus thusly (specifying the encoding):

my_corpus = CategorizedPlaintextCorpusReader('C:\\rss_feeds', r'.*/.*', cat_pattern=r'(.*)/.*',encoding='iso-8859-1')
print len(my_corpus.categories())
myfreq_dist = make_training_data(my_corpus)

but it only changed the results to:

<FreqDist: u'\x92': 494, u'\x93': 300, u'\x97': 159, u',\x94': 134, u'company': 124, u'.\x94': 88, u'app': 84, u'Twitter': 82, u'people': 76, u'time': 73, ...>

The python code file encoding is set:

# -*- coding: iso-8859-1 -*-

For completeness, I use the following code to manipulate the Corpus Reader into training data:

def make_training_data(rdr):
    all_freq_dist = []
    #take union of all stopwords and punctuation marks
    punctuation = set(['.', '?', '!', ',', '$', ':', ';', '(',')','-',"`",'\'','"','>>','|','."',',"'])
    full_stop_set = set(nltk.corpus.stopwords.words('english')) | punctuation
    for c in rdr.categories():
        all_category_words = []
        for f in rdr.fileids(c):
            #try to remove stop words and punctuation
            filtered_file_words = [w for w in rdr.words(fileids=[f]) if not w.lower() in full_stop_set]
            #add the words from each file to the list of words for the category
            all_category_words = all_category_words + filtered_file_words
        list_cat_fd = FreqDist(all_category_words), c
        print list_cat_fd
        all_freq_dist.append(list_cat_fd)
    return all_freq_dist

When I open the files themselves in Notepad++ it says that they are encoded in ANSI.

Ideally I would like to remove special characters and punctuation from the word list before generating the frequency distribution. Any help would be greatly appreciated.

it might not be special characters, it might be accented characters. see http://stackoverflow.com/questions/3328995/how-to-remove-xe2-from-a-list — alvas, Sep 26 '13 at 15:02

score 1 · Answer 1 · answered Sep 27 '13 at 05:20

The easiest solution at the moment seems to be to add another set of characters (unicode_chars) to the full stop set to be eliminated before generating the frequency distribution:

punctuation = set(['.', '?', '!', ',', '$', ':', ';', '(',')','-',"`",'\'','"','>>','|','."',',"'])
other_words = set([line.strip() for line in codecs.open('stopwords.txt',encoding='utf8')])
unicode_chars = set([u',\u201d',u'\u2019',u'\u2014',u'\u201c',u'.\u201d',u'\ufffd', u',\ufffd', u'.\ufffd'])
full_stop_set = set(nltk.corpus.stopwords.words('english')) | punctuation | other_words | unicode_chars

and then in the loop as before:

filtered_file_words = [w for w in rdr.words(fileids=[f]) if not w.lower() in full_stop_set]

It may not be the prettiest, but it keeps the strange characters from being considered in the frequency distribution.

Handling Character Encoding Problems in Python using NLTK

1 Answers1