Anaconda: UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

Question

I want to calculate TF_IDF for a set of documents (10). I use Python Anaconda for this.

import nltk
import string
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

path = '/opt/datacourse/data/parts'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
for item in tokens:
    stemmed.append(stemmer.stem(item))
return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

for subdir, dirs, files in os.walk(path):
    for file in files:
    file_path = subdir + os.path.sep + file
    shakes = open(file_path, 'r')
    text = shakes.read()
    lowers = text.lower()
    no_punctuation = lowers.translate(None, string.punctuation)
    token_dict[file] = no_punctuation

    tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
    tfs = tfidf.fit_transform(token_dict.values())

But after printing tfs = tfidf.fit_transform(token_dict.values()) I get the following error message.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

How do I fix this error?

Try latin-1 rather than utf8 – Rolf of Saxony Nov 05 '15 at 07:43 — Rolf of Saxony, Nov 05 '15 at 07:43
How do I change the code to try latin-1? – dave Nov 05 '15 at 07:57 — dave, Nov 05 '15 at 07:57
tfs = tfs.decode('latin-1') – Rolf of Saxony Nov 05 '15 at 08:11 — Rolf of Saxony, Nov 05 '15 at 08:11

score 1 · Answer 1 · edited May 23 '17 at 11:52

I was using same reference for data preprocessing and got the exactly same error. These are several steps which I took and got perfectly working code on Pyhton 2.7 on Ubuntu 14.04 Machine,

1) Use "codecs" to open file and set "encoding" parameter as ISO-8859-1. Here is how you do it

import codecs
with codecs.open(pathToYourFileWithFileName,"r",encoding = "ISO-8859-1") as file_handle:

2) As you do this first step, you bump into 2nd problem while using

no_punctuation = lowers.translate(None, string.punctuation)

which is explained here string.translate() with unicode data in python

The solution will go like

lowers = text.lower()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
no_punctuation = lowers.translate(remove_punctuation_map)

I hope it helps.

score 0 · Answer 2 · answered Nov 05 '15 at 08:11

Your data is encoded with other encoding :)

To decode data in string, use the following

myvar.decode("ENCODING")

Where encoding can be any encoding name. That function is doing it in background, decoding on "utf-8".

You should try "latin1" or "latin2"; both of them, with utf-8 are the most common used

Cheers

Anaconda: UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

2 Answers2