6

Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:

file = open("exampleFileName", "r")  
    for line in file:  
        pass

The error message:

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    example()
  File "example.py", line 16, in example
    for line in file:
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte

How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?

Thanks and best regards!

EliteKaffee
  • 109
  • 1
  • 2
  • 12
  • Possibly related to http://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte – Jeff Aug 17 '16 at 16:26
  • Post the output of `file -bi [your_filename]`. You'll get an encoding. After that provide the `encoding` argument to `open()`. –  Aug 17 '16 at 16:27
  • what does file -bi command does? – Reihan_amn Mar 01 '18 at 23:15

2 Answers2

12

It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try

file = open('exampleFileName', 'r', encoding='latin-1') 
mic4ael
  • 7,974
  • 3
  • 29
  • 42
  • Do you know how to do the same when reading from command line? I use `input()` function, is there a way to configure its encoding or is there some other configurable function? – chivorotkiv Nov 11 '17 at 14:26
  • How did you figure out to use latin-1 encoding? – Reihan_amn Mar 01 '18 at 23:19
  • 0xed is `í` characters which you can find in the latin-1 encoding – mic4ael Mar 02 '18 at 06:06
  • So confused! after unicode encoding came into the scene to cover all ~2 m code point, why latin-1 encoding is still here? shouldn't latin-1 encoding be a subset of UTF encoding? shouldn't all defined codes in latin-1 be now a part of UTF? if so, why UTF cannot support it? (sorry I am kinda new in this field) – Reihan_amn Mar 08 '18 at 20:12
0

It is not possible to identify the encoding on the fly. So, either user a method which I wrote as a comment or use similar constructions (as proposed by another answer), but this is a wild shot:

try:
    file = open("exampleFileName", "r")
except UnicodeDecodeError:
    try:
        file = open("exampleFileName", "r", encoding="latin2")
    except: #...

And so on, until you test all the encodings from Standard Python Encodings.

So I think there's no need to bother with this nested hell, just do file -bi [filename] once, copy the encoding and forget about this.

UPD. Actually, I've found another stackoverflow answer which you can use if you're on Windows.

Community
  • 1
  • 1