Porting from Python 2 to Python 3: 'utf-8 codec can't decode byte'

Question

Hey I tried to port that little snippet to Python 3 from 2.

Python 2:

def _download_database(self, url):
  try:
    with closing(urllib.urlopen(url)) as u:
      return StringIO(u.read())
  except IOError:
    self.__show_exception(sys.exc_info())
  return None

Python 3:

def _download_database(self, url):
  try:
    with closing(urllib.request.urlopen(url)) as u:
      response = u.read().decode('utf-8')
      return StringIO(response)
  except IOError:
    self.__show_exception(sys.exc_info())
  return None

But I'm still getting

utf-8 codec can't decode byte 0x8f in position 12: invalid start byte

I need to use StringIO since its a zipfile and i want to parse it with that function:

   def _parse_zip(self, raw_zip):
  try:
     zip = zipfile.ZipFile(raw_zip)

     filelist = map(lambda x: x.filename, zip.filelist)
     db_file  = 'IpToCountry.csv' if 'IpToCountry.csv' in filelist else filelist[0]

     with closing(StringIO(zip.read(db_file))) as raw_database:
        return_val = self.___parse_database(raw_database)

     if return_val:
        self._load_data()

  except:
     self.__show_exception(sys.exc_info())
     return_val = False

  return return_val

raw_zip is the return of the download_database func

The encoding of the data you received is, apparently, *not* UTF-8. What encoding is it? If the web server is correct, then the `Content-Type` header of the HTTP response should tell you, as well as potentially an HTML `` tag in the document (if it is HTML). — dsh, Dec 17 '15 at 14:00
[Here](https://stackoverflow.com/search?q=[python-3]+codec+can%27t+decode+answers%3A1) are existing questions on StackOverflow with answers explaining decoding bytes to characters. — dsh, Dec 17 '15 at 15:24
That url is downloading a zip file -- why are you trying to convert a binary file into a string? — Ethan Furman, Dec 17 '15 at 18:05
I added an answer for what im doing next with that.. It worked in python2 but dont know why not in python3 — Fragkiller, Dec 17 '15 at 18:11
Python 2 does not have a clear separation between `binary` and `text`, but Python 3 does: binary data is `bytes` and text is `str`. `raw_zip` should be binary, so you don't need to decode it. — Ethan Furman, Dec 17 '15 at 18:15
@Fragkiller As others have noted, you are retrieving a binary file, not text. Ashley Wilson's answer shows how to get the bytes. The reason it "worked" in Python 2 is because Python 2 was very sloppy regarding bytes versus characters and didn't handle Unicode well. In Python 3 you need to understand the difference. — dsh, Dec 17 '15 at 18:16

score 5 · Accepted Answer · edited May 23 '17 at 12:08

utf-8 can't decode arbitrary binary data.

utf-8 is a character encoding that can be used to encode a text (e.g., represented as str type in Python 3 -- a sequence of Unicode codepoints) into bytestring (bytes type -- sequence of bytes (small integers in [0, 255] interval)) and decode it back.

utf-8 is not the only character encoding. There are character encodings that are incompatible with utf-8. Even if .decode('utf-8') hasn't raised an exception; it doesn't mean that the result is correct -- you may get mojibake if you use a wrong character encoding to decode text. See A good way to get the charset/encoding of an HTTP response in Python.

Your input is a zip-file -- binary data is not text and therefore you should not try to decode it to text.

Python 3 helps you to find errors related to mixing binary data and text. To port code from Python 2 to Python 3, you should understand text (Unicode) vs. binary data (bytes) distinction.

str on Python 2 is a bytestring that can be used for binary data and (encoded) text. Unless from __future__ import unicode_literals is present; '' literal creates a bytestring in Python 2. u'' creates unicode instance. On Python 3 str type is Unicode. bytes refers to sequence of bytes on both Python 3 and Python 2.7 (bytes is an alias for str on Python 2). b'' creates bytes instance on both Python 2/3.

urllib.request.urlopen(url) returns a file-like object (binary file), you could pass it as is in some cases e.g., to decode remote gzipped content on-the-fly:

#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request

with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
                     headers={"Accept-Encoding": "gzip"})) as response, \
     GzipFile(fileobj=response) as xml_file:
    for elem in getelements(xml_file, 'interesting_tag'):
        process(elem)

ZipFile() requires a seek()-able file and therefore you can't pass urlopen() directly. You have to download the content first. You could use io.BytesIO(), to wrap it:

#!/usr/bin/env python3
import io
import zipfile
from urllib.request import urlopen

url = "http://www.pythonchallenge.com/pc/def/channel.zip"
with urlopen(url) as r, zipfile.ZipFile(io.BytesIO(r.read())) as archive:
    print({member.filename: archive.read(member) for member in archive.infolist()})

StringIO() is text file. It stores Unicode in Python 3.

score 3 · Answer 2 · answered Dec 17 '15 at 14:24

If all you're interested in is returning a stream handler from your function (rather than having a requirement to decode the content), can you use BytesIO instead of StringIO:

from contextlib import closing
from io import BytesIO
from urllib.request import urlopen

url = 'http://www.google.com'


with closing(urlopen(url)) as u:
    response = u.read()
    print(BytesIO(response))

Ethan Furman · Answer 3 · 2015-12-17T18:15:44.490

1

The link you posted, http://software77.net/geo-ip?DL=2 is trying to download a zip file, which is binary.

You shouldn't convert a binary blob to a str (just use BytesIO)
If you have a really good reason to do so anyway, use latin-1 as the decoder.

edited Dec 17 '15 at 18:15

answered Dec 17 '15 at 18:07

Ethan Furman

63,992
20
159
237

Porting from Python 2 to Python 3: 'utf-8 codec can't decode byte'

3 Answers3