1

I want to calculate TF_IDF for a set of documents (10). I use Python Anaconda for this.

import nltk
import string
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

path = '/opt/datacourse/data/parts'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
for item in tokens:
    stemmed.append(stemmer.stem(item))
return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

for subdir, dirs, files in os.walk(path):
    for file in files:
    file_path = subdir + os.path.sep + file
    shakes = open(file_path, 'r')
    text = shakes.read()
    lowers = text.lower()
    no_punctuation = lowers.translate(None, string.punctuation)
    token_dict[file] = no_punctuation

    tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
    tfs = tfidf.fit_transform(token_dict.values())

But after printing tfs = tfidf.fit_transform(token_dict.values()) I get the following error message.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

How do I fix this error?

dave
  • 141
  • 1
  • 3
  • 8

2 Answers2

1

I was using same reference for data preprocessing and got the exactly same error. These are several steps which I took and got perfectly working code on Pyhton 2.7 on Ubuntu 14.04 Machine,

1) Use "codecs" to open file and set "encoding" parameter as ISO-8859-1. Here is how you do it

import codecs
with codecs.open(pathToYourFileWithFileName,"r",encoding = "ISO-8859-1") as file_handle:

2) As you do this first step, you bump into 2nd problem while using

no_punctuation = lowers.translate(None, string.punctuation)

which is explained here string.translate() with unicode data in python

The solution will go like

lowers = text.lower()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
no_punctuation = lowers.translate(remove_punctuation_map)

I hope it helps.

Community
  • 1
  • 1
vvs14
  • 720
  • 8
  • 19
0

Your data is encoded with other encoding :)

To decode data in string, use the following

myvar.decode("ENCODING")

Where encoding can be any encoding name. That function is doing it in background, decoding on "utf-8".

You should try "latin1" or "latin2"; both of them, with utf-8 are the most common used

Cheers

Isaac
  • 299
  • 3
  • 15