UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc : invalid start byte

Question

I'm getting the following error when executing my script which analyses text from a csv file.

The sentence contains German characters such as é and ü. It looks like Python is falling over on these characters. I've tried changing from ascii to utf-8 encoding but this hasn't really helped.

My Python script:

    import csv
    from textblob import TextBlob

    infile = 'C:\Python27\file.csv'

    with open(infile, 'rb') as csvfile:
     rows = csv.reader(csvfile)
     for row in rows:
      sentence = row[4]
      blob = TextBlob(sentence)
      print sentence
      print blob.sentiment.polarity, blob.sentiment.subjectivity

(Also if someone can explain how to output the results into a csv file, would be very much appreciated.)

Show the problematic data. See the [Stack Overflow `character-encoding` tag info page](http://stackoverflow.com/tags/character-encoding/info) for troubleshooting information and [edit] your question into a [mcve]. — tripleee, Jul 31 '18 at 16:14
But from the circumstantial evidence I guess the file in encoded with `latin-1`, not `utf-8`. — tripleee, Jul 31 '18 at 16:15
it looks like its falling to decode the sentence "Ooops. Sorry... @o2de Bitte übernehmen Danke!" The error msg is: UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 28: invalid start byte — dragonfury2, Jul 31 '18 at 16:34
Thank you changing encoding to 'Latin-1' helped. Only question is, will this have implications when decoding any other 'English' chars? — dragonfury2, Jul 31 '18 at 16:38
No, as long as the entire file contains the same encoding, you should be fine. Please accept he duplicate nomination once I find a suitable post with a proper answer. — tripleee, Jul 31 '18 at 17:03
As an aside, you should definitely be swrtching to Python 3 very soon. By the original timetable, vesrion 2 was supposed to be end-of-lifed earlier this year (though it got an extension, and is now in prolonged terminal care). Py3 brings some changes in this area, generally for the better; though you'll also want to switch away from legacy 8-bit encodings in your data files to reap full benefits. — tripleee, Jul 31 '18 at 17:13
Also on Windows, you shoud probably use `r'raw strings'` for file paths with backslashes in them. — tripleee, Jul 31 '18 at 17:14
Unicode handling is much nicer on Python 3. In the mean time you may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. Also check out the [Unicode HOWTO](https://docs.python.org/2/howto/unicode.html) in the Python docs. And I guess I ought to mention [Joel's classic article](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) — PM 2Ring, Jul 31 '18 at 19:04

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc : invalid start byte

0 Answers0