0

UnicodeDecodeError

def getWordFreqs(textPath, stopWordsPath):
    wordFreqs = dict()
    #open the file in read mode and open stop words
    file = open(textPath, 'r')
    stopWords = set(line.strip() for line in open(stopWordsPath))
    #read the text
    text = file.read()
    #exclude punctuation and convert to lower case; exclude numbers as well
    punctuation = set('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~')
    text = ''.join(ch.lower() for ch in text if ch not in punctuation)
    text = ''.join(ch for ch in text if not ch.isdigit())
    #read through the words and add to frequency dictionary
    #if it is not a stop word
    for word in text.split():
        if word not in stopWords:
            if word in wordFreqs:
                wordFreqs[word] += 1
            else:
                wordFreqs[word] = 1

I get the following error everytime i try and run this function in python 3.5.2 but it works fine in 3.4.3, I cannot figure out what is causing this error.

line 9, in getWordFreqs
    text = file.read()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 520: ordinal not in range(128)
furas
  • 134,197
  • 12
  • 106
  • 148
John
  • 11
  • 3
  • use button `{}` to correctly display code. – furas Oct 29 '16 at 23:37
  • Please format your code correctly by copy-and-paste from the original source code, then highlighting the code and clicking the `{}` button in the editor. – Rory Daulton Oct 29 '16 at 23:37
  • Probably Python tries to decode file to unicode when it read file but it doesn't know what encoding is used in file so it treats it as ASCII. Maybe try `encoding=` in `open()`: https://docs.python.org/3/library/functions.html#open – furas Oct 29 '16 at 23:43
  • I tried this but now it kicks back this error... UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 520: invalid start byte >>> – John Oct 30 '16 at 00:12

2 Answers2

1

In Python 3, open defaults to using the encoding returned by locale.getpreferredencoding(False). It isn't usually ascii, though, but it can be if running under some kind of framework, which your error message indicates.

Instead, specify the encoding of the file you are trying to read. If the file was created under Windows, it is likely the encoding is cp1252, especially since the byte \x97 is an EM DASH under that encoding.

Try:

file = open(textPath, 'r', encoding='cp1252')
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
-2

I believe One way to solve your question is by putting this code at the top of your file.

import sys
reload(sys)
sys.setdefaultencoding("UTF8")

This will set the encoding to UTF8

Another (better) solution is a library called codecs, that is very easy to use.

import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )

The fileObj is then a normal file object that can be read from and written to.

Source for Method 1 Source for Method 2

Note for Method 1
This can be extremely dangerous when using third party apps that use ASCII as their encoding. Use with caution.

Community
  • 1
  • 1
Superman
  • 196
  • 1
  • 2
  • 8
  • [Why sys.setdefaultencoding will break code](https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/). Definitely not the "best way". The function doesn't work without the `reload(sys)` trick for a reason. – Mark Tolonen Oct 30 '16 at 01:39
  • @MarkTolonen I clarified what I meant to say – Superman Oct 30 '16 at 02:42