How to get rid of the UnicodeDecodeError while reading raw data from PlaintextCorpusReader

Question

I am creating a Corpus from a set of text files in the following manner:

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

Now I wish to access the words of a file in the following manner:

text_bow = newcorpus.words("file_name.txt")

But I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

There are multiple files that throw is error. How can I get rid of this UnicodeDecodeError?

score 1 · Answer 1 · answered Dec 19 '17 at 03:21

First, find our which encoding your files are encoded in. Maybe try https://stackoverflow.com/a/16203777/610569 or ask the source of your data.

Then use the encoding= argument in PlaintextCorpusReader, e.g. for latin-1:

newcorpus = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')

From the code https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py :

class PlaintextCorpusReader(CorpusReader):
"""
Reader for corpora that consist of plaintext documents.  Paragraphs
are assumed to be split using blank lines.  Sentences and words can
be tokenized using the default tokenizers, or by custom tokenizers
specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface
sections of specific document formats) by creating a subclass and
overriding the ``CorpusView`` class variable.
"""

CorpusView = StreamBackedCorpusView
"""The corpus view class used by this reader.  Subclasses of
   ``PlaintextCorpusReader`` may specify alternative corpus view
   classes (e.g., to skip the preface sections of documents.)"""

def __init__(self, root, fileids,
             word_tokenizer=WordPunctTokenizer(),
             sent_tokenizer=nltk.data.LazyLoader(
                 'tokenizers/punkt/english.pickle'),
             para_block_reader=read_blankline_block,
             encoding='utf8'):

score 0 · Answer 2 · answered Dec 19 '17 at 03:05

To get rid of the decode error, do one of the following.

Read the corpus file as bytes, and do not decode to unicode.
Discover and use the encoding used for the file. (The corpus doc should tell you.) I suspect that it is Latin-1.
Use Latin-1 regardless of the actual encoding. This will get rid of the exception, even if the resulting string is erroneous in not having the original content.

How to get rid of the UnicodeDecodeError while reading raw data from PlaintextCorpusReader

2 Answers2