Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Question

I'm reading a bunch of RTF files into python strings. On SOME texts, I get this error:

Traceback (most recent call last):
  File "11.08.py", line 47, in <module>
    X = vectorizer.fit_transform(texts)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
    term_count_current = Counter(analyze(doc))
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
    doc = doc.decode(self.charset, self.charset_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
 start byte

I've tried:

Copying and pasting the text of the files to new files
saving the rtf files as txt files
Openin the txt files in Notepad++ and choosing 'convert to utf-8' and also setting the encoding to utf-8
Opening the files with Microsoft Word and saving them as new files

Nothing works. Any ideas?

It's probably not related, but here's the code incase you are wondering:

f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)

try `X = vectorizer.fit_transform(texts.encode('utf-8'))`, if I remember correctly, I always mess up when to use .encode() en .decode(), just try one and see what happens... — BrtH, Aug 12 '12 at 00:16
try this but I'm not sure it works. string = ''.join( [chr(ord(i)) for i in string]) — Squall, Aug 12 '12 at 03:04

score 10 · Answer 1 · answered Aug 12 '12 at 00:58

10

This will solve your issues:

import codecs

f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()

from that moment txt is in unicode format and you can use it everywhere in your code.

If you want to generate UTF-8 files after your processing do:

f.write(txt.encode('utf-8'))

answered Aug 12 '12 at 00:58

Chelu Martín

578
5
17

3

the new open() returns: `Traceback (most recent call last): File "11.08.py", line 41, in t = f.read() File "C:\Python27\lib\codecs.py", line 671, in read return self.reader.read(size) File "C:\Python27\lib\codecs.py", line 477, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1266: invali d start byte` – Zach Aug 12 '12 at 01:31

score 6 · Accepted Answer · answered Aug 12 '12 at 10:07

6

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore. If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer. See the docs.

answered Aug 12 '12 at 10:07

Andreas Mueller

27,470
8
62
74

1

It's better to be aware of the charset of the document corpus and pass that explicitly to the `TfidfVectorizer` class so as to avoid silent decoding errors that might results in bad classification accuracy in the end. – ogrisel Aug 28 '12 at 10:15

score 4 · Answer 3 · edited Jul 10 '15 at 07:08

4

You can dump the csv file rows in json file without any encoding error as follows:

json.dump(row,jsonfile, encoding="ISO-8859-1")

edited Jul 10 '15 at 07:08

Wtower

18,848
11
103
80

answered Jul 09 '15 at 07:47

Piyush S. Wanare

4,703
6
37
54

score 2 · Answer 4 · answered May 15 '19 at 09:47

2

Keep this line :

vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')

encoding = 'latin-1' worked for me.

answered May 15 '19 at 09:47

Shalini Baranwal

2,780
4
24
34

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

4 Answers4

Linked