First, find our which encoding your files are encoded in. Maybe try https://stackoverflow.com/a/16203777/610569 or ask the source of your data.
Then use the encoding=
argument in PlaintextCorpusReader
, e.g. for latin-1
:
newcorpus = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')
From the code https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py :
class PlaintextCorpusReader(CorpusReader):
"""
Reader for corpora that consist of plaintext documents. Paragraphs
are assumed to be split using blank lines. Sentences and words can
be tokenized using the default tokenizers, or by custom tokenizers
specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface
sections of specific document formats) by creating a subclass and
overriding the ``CorpusView`` class variable.
"""
CorpusView = StreamBackedCorpusView
"""The corpus view class used by this reader. Subclasses of
``PlaintextCorpusReader`` may specify alternative corpus view
classes (e.g., to skip the preface sections of documents.)"""
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):