Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors

Question

I had a script I was trying to port from python2 to python3.

I did read through the porting documentation, https://docs.python.org/3/howto/pyporting.html.

My original python2 script used open('filename.txt). In porting to python3 I updated it to io.open('filename.txt'). Now when running the script as python2 or python3 with the same input files I get some errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 793: invalid start byte.

Does python2 open have less strict error checking than io.open or does it use a different default encoding? Does python3 have an equivalent way to call io.open to match python2 built in open?

Currently I've started using f = io.open('filename.txt', mode='r', errors='replace') which works. And comparing output to the original python2 version, no important data was lost from the replace errors.

Just search online for the error message to get an idea what may cause it. Make sure you also extract a [mcve] from your code so you know the actual code causing it. — Ulrich Eckhardt, Feb 04 '22 at 15:14

chepner · Accepted Answer · 2022-02-04T15:18:02.597

0

First, io.open is open; there's no need to stop using open directly.

The issue is that your call to open is assuming the file is UTF-8-encoded when it is not. You'll have to supply the correct encoding explicitly.

open('filename.txt', encoding='iso-8859')  # for example

(Note that the default encoding is platform-specific, but your error indicates that you are, in fact, defaulting to UTF-8.)

In Python 2, no attempt was made to decode non-ASCII files; reading from a file returned a str value consisting of whatever bytes were actually stored in the file.

This is part of the overall shift in Python 3 from using the old str type as sometimes text, sometimes bytes, to using str exclusively for Unicode text and bytes for any particular encoding of the text.

edited Feb 04 '22 at 15:18

answered Feb 04 '22 at 15:12

chepner

497,756
71
530
681

So python2 did not do any default encoding to go from the file reading to str? – Panda Feb 04 '22 at 15:44
No. Basically, the `unicode` type corresponded to what we now use `str` for, and the `str` type was used indiscriminately for data which we would now pick `str` or `bytes` to represent. A big change in Python 3 was to force a distinction between the two. A file opened in text mode has an encoding associated with it, and reading from the file will always (try to) decode the bytes read to give you a `str`. A file opened in binary mode and reading from it will always return a `bytes` value (which you may or may not be able to decode to give you a `str` value). – chepner Feb 04 '22 at 15:54

Python3 equivalent to Python2 open when encountering UnicodeDecodeErrors

1 Answers1