4

I'm writing a code for stemming a tweet, but I'm having issues with encoding. When I tried to apply porter stemmer it shows error.Maybe i m not able to tokenize it properly.

My code is as follows...

import sys
import pandas as pd
import nltk
import scipy as sp
from nltk.classify import NaiveBayesClassifier
from nltk.stem import PorterStemmer
reload(sys)  
sys.setdefaultencoding('utf8')


stemmer=nltk.stem.PorterStemmer()

p_test = pd.read_csv('TestSA.csv')
train = pd.read_csv('TrainSA.csv')

def word_feats(words):
    return dict([(word, True) for word in words])

for i in range(len(train)-1):
    t = []
    #train.SentimentText[i] = " ".join(t)
    for word in nltk.word_tokenize(train.SentimentText[i]):
        t.append(stemmer.stem(word))
    train.SentimentText[i] = ' '.join(t)

When I try to execute it returns the error:


UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-5aa856d0307f> in <module>()
     23     #train.SentimentText[i] = " ".join(t)
     24     for word in nltk.word_tokenize(train.SentimentText[i]):
---> 25         t.append(stemmer.stem(word))
     26     train.SentimentText[i] = ' '.join(t)
     27 

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in stem(self, word)
    631     def stem(self, word):
    632         stem = self.stem_word(word.lower(), 0, len(word) - 1)
--> 633         return self._adjust_case(word, stem)
    634 
    635     ## --NLTK--

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in _adjust_case(self, word, stem)
    602         for x in range(len(stem)):
    603             if lower[x] == stem[x]:
--> 604                 ret += word[x]
    605             else:
    606                 ret += stem[x]

/usr/lib64/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

anybody has any clue, wat is wrong with my code.I m stuck with this error.Any suggestions..?

Vishal Kharde
  • 1,553
  • 3
  • 16
  • 34
  • Use `python3`, see https://www.youtube.com/watch?v=sgHbC6udIqc – alvas Jan 14 '16 at 11:05
  • Possible duplicate of [UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c](http://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c) – alvas Jan 14 '16 at 11:06
  • Related: http://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte and http://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c – alvas Jan 14 '16 at 11:07
  • To help you precisely, could you post the data set or a sample online. Without seeing the data, it's a shot in the dark to know what went wrong. – alvas Jan 14 '16 at 11:14
  • @alvis SentimentText contains the text of a tweet from twitter. – Vishal Kharde Jan 14 '16 at 11:20
  • Knowing the genre of text doesn't help unless you can see and know what is the exact encoding of the data. – alvas Jan 14 '16 at 11:37
  • It seems very very likely that you're dealing with latin-1 encoding outside of ascii and by correctly encoding/decoding it as utf8, you might resolve the problem but still without seeing the data, it's just random guess as to what is happening =( – alvas Jan 14 '16 at 11:38
  • Also, do note that the "sin" of `sys.setdefaultencoding()`: http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script – alvas Jan 14 '16 at 11:42

1 Answers1

3

I think the key line is 604, one frame above the place which raises the error:

--> 604                 ret += word[x]

Probably ret is an Unicode string and word is a byte string. And you cannot decode UTF-8 byte by byte, as that loop is trying to do.

The problem is that read_csv is returning bytes, and you are trying to do text processing on those bytes. That simply doesn't work, those bytes have to be decoded to Unicode first. I think you can use:

pandas.read_csv(filename, encoding='utf-8')

If possible, use Python 3. Then trying to concatenate bytes and unicode will always raise an error, making it much easier to spot these problems.

roeland
  • 5,349
  • 2
  • 14
  • 28