-1

I had a script I was trying to port from python2 to python3.

I did read through the porting documentation, https://docs.python.org/3/howto/pyporting.html.

My original python2 script used open('filename.txt). In porting to python3 I updated it to io.open('filename.txt'). Now when running the script as python2 or python3 with the same input files I get some errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 793: invalid start byte.

Does python2 open have less strict error checking than io.open or does it use a different default encoding? Does python3 have an equivalent way to call io.open to match python2 built in open?

Currently I've started using f = io.open('filename.txt', mode='r', errors='replace') which works. And comparing output to the original python2 version, no important data was lost from the replace errors.

Panda
  • 690
  • 1
  • 6
  • 19
  • Just search online for the error message to get an idea what may cause it. Make sure you also extract a [mcve] from your code so you know the actual code causing it. – Ulrich Eckhardt Feb 04 '22 at 15:14

1 Answers1

0

First, io.open is open; there's no need to stop using open directly.

The issue is that your call to open is assuming the file is UTF-8-encoded when it is not. You'll have to supply the correct encoding explicitly.

open('filename.txt', encoding='iso-8859')  # for example

(Note that the default encoding is platform-specific, but your error indicates that you are, in fact, defaulting to UTF-8.)

In Python 2, no attempt was made to decode non-ASCII files; reading from a file returned a str value consisting of whatever bytes were actually stored in the file.

This is part of the overall shift in Python 3 from using the old str type as sometimes text, sometimes bytes, to using str exclusively for Unicode text and bytes for any particular encoding of the text.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • So python2 did not do any default encoding to go from the file reading to str? – Panda Feb 04 '22 at 15:44
  • No. Basically, the `unicode` type corresponded to what we now use `str` for, and the `str` type was used indiscriminately for data which we would now pick `str` or `bytes` to represent. A big change in Python 3 was to force a distinction between the two. A file opened in text mode has an encoding associated with it, and reading from the file will always (try to) decode the bytes read to give you a `str`. A file opened in binary mode and reading from it will always return a `bytes` value (which you may or may not be able to decode to give you a `str` value). – chepner Feb 04 '22 at 15:54