UnicodeDecodeError on python3

Question

Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:

file = open("exampleFileName", "r")  
    for line in file:  
        pass

The error message:

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    example()
  File "example.py", line 16, in example
    for line in file:
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte

How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?

Thanks and best regards!

Possibly related to http://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte — Jeff, Aug 17 '16 at 16:26
Post the output of `file -bi [your_filename]`. You'll get an encoding. After that provide the `encoding` argument to `open()`. — , Aug 17 '16 at 16:27

score 12 · Accepted Answer · answered Aug 17 '16 at 16:25

12

It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try

file = open('exampleFileName', 'r', encoding='latin-1')

answered Aug 17 '16 at 16:25

mic4ael

7,974
3
29
42

Do you know how to do the same when reading from command line? I use `input()` function, is there a way to configure its encoding or is there some other configurable function? – chivorotkiv Nov 11 '17 at 14:26
How did you figure out to use latin-1 encoding? – Reihan_amn Mar 01 '18 at 23:19
0xed is `í` characters which you can find in the latin-1 encoding – mic4ael Mar 02 '18 at 06:06
So confused! after unicode encoding came into the scene to cover all ~2 m code point, why latin-1 encoding is still here? shouldn't latin-1 encoding be a subset of UTF encoding? shouldn't all defined codes in latin-1 be now a part of UTF? if so, why UTF cannot support it? (sorry I am kinda new in this field) – Reihan_amn Mar 08 '18 at 20:12

score 0 · Answer 2 · edited May 23 '17 at 12:33

It is not possible to identify the encoding on the fly. So, either user a method which I wrote as a comment or use similar constructions (as proposed by another answer), but this is a wild shot:

try:
    file = open("exampleFileName", "r")
except UnicodeDecodeError:
    try:
        file = open("exampleFileName", "r", encoding="latin2")
    except: #...

And so on, until you test all the encodings from Standard Python Encodings.

So I think there's no need to bother with this nested hell, just do file -bi [filename] once, copy the encoding and forget about this.

UPD. Actually, I've found another stackoverflow answer which you can use if you're on Windows.

UnicodeDecodeError on python3

2 Answers2