UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

Question

Help me figure out whats wrong with my python code.

thats the code

import nltk
import re
import pickle


raw = open('tom_sawyer_shrt.txt').read()

### this is how the basic Punkt sentence tokenizer works
#sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(raw)

### train & tokenize text using text
sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw)
sent_tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer(sent_trainer)
# break in to sentences
sents = sent_tokenizer.tokenize(raw)
# get sentence start/stop indexes
sentspan = sent_tokenizer.span_tokenize(raw)



###  Remove \n in the middle of setences, due to fixed-width formatting
for i in range(0,len(sents)-1):
    sents[i] = re.sub('(?<!\n)\n(?!\n)',' ',raw[sentspan[i][0]:sentspan[i+1][0]])

for i in range(1,len(sents)):
    if (sents[i][0:3] == '"\n\n'):
        sents[i-1] = sents[i-1]+'"\n\n'
        sents[i] = sents[i][3:]


### Loop thru each sentence, fix to 140char
i=0
tweet=[]
while (i<len(sents)):
    if (len(sents[i]) > 140):
        ntwt = int(len(sents[i])/140) + 1
        words = sents[i].split(' ')
        nwords = len(words)
        for k in range(0,ntwt):
            tweet = tweet + [
                re.sub('\A\s|\s\Z', '', ' '.join(
                words[int(k*nwords/float(ntwt)):
                      int((k+1)*nwords/float(ntwt))]
                ))]
        i=i+1
    else:
        if (i<len(sents)-1):
            if (len(sents[i])+len(sents[i+1]) <140):
                nextra = 1
                while (len(''.join(sents[i:i+nextra+1]))<140):
                    nextra=nextra+1
                tweet = tweet+[
                    re.sub('\A\s|\s\Z', '',''.join(sents[i:i+nextra]))
                    ]        
                i = i+nextra
            else:
                tweet = tweet+[re.sub('\A\s|\s\Z', '',sents[i])]
                i=i+1
        else:
            tweet = tweet+[re.sub('\A\s|\s\Z', '',sents[i])]
            i=i+1


### A last pass to clean up leading/trailing newlines/spaces.
for i in range(0,len(tweet)):
    tweet[i] = re.sub('\A\s|\s\Z','',tweet[i])

for i in range(0,len(tweet)):
    tweet[i] = re.sub('\A"\n\n','',tweet[i])


###  Save tweets to pickle file for easy reading later
output = open('tweet_list.pkl','wb')
pickle.dump(tweet,output,-1)
output.close()


listout = open('tweet_lis.txt','w')
for i in range(0,len(tweet)):
    listout.write(tweet[i])
    listout.write('\n-----------------\n')

listout.close()

and thats the error message

Traceback (most recent call last): File "twain_prep.py", line 13, in sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1227, in train token_cls=self._Token).get_params() File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 649, in init self.train(train_text, verbose, finalize=True) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 713, in train self._train_tokens(self._tokenize_words(text), verbose) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 729, in _train_tokens tokens = list(tokens) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words for line in plaintext.split('\n'): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

Rahul · Answer 1 · 2017-11-01T06:52:54.900

UnicodeDecodeError happens when your string has some Unicode in it. Basically, Python string handles ascii values only and that's why when you are sending your text to tokenizer it must be containing some character which is not in ascii list.

So how to fix it?

You can convert your text to ascii characters and ignore the 'Unicode' ones.

raw = raw..encode('ascii', 'ignore')

Also, you can read this post to handle Unicode errors.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

1 Answers1