22

I'm getting really tired of trying to figure out why this code works in Python 2 and not in Python 3. I'm just trying to grab a page of json and then parse it. Here's the code in Python 2:

import urllib, json
response = urllib.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content)

I thought the equivalent code in Python 3 would be this:

import urllib.request, json
response = urllib.request.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content)

But it blows up in my face, because the data returned by read() is a "bytes" type. However, I cannot for the life of me get it to convert to something that json will be able to parse. I know from the headers that reddit is trying to send utf-8 back to me, but I can't seem to get the bytes to decode into utf-8:

import urllib.request, json
response = urllib.request.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content.decode("utf8"))

What am I doing wrong?

Edit: the problem is that I cannot get the data into a usable state; even though json loads the data, part of it is undisplayable, and I want to be able to print the data to the screen.

Second edit: The problem has more to do with print than parsing, it seems. Alex's answer provides a way for the script to work in Python 3, by setting the IO to utf8. But a question still remains: why is it that the code worked in Python 2, but not Python 3?

dreftymac
  • 31,404
  • 26
  • 119
  • 182
Dan Lew
  • 85,990
  • 32
  • 182
  • 176

4 Answers4

15

The code you post is presumably due to wrong cut-and-paste operations because it's clearly wrong in both versions (f.read() fails because there's no f barename defined).

In Py3, ur = response.decode('utf8') works perfectly well for me, as does the following json.loads(ur). Maybe the wrong copys-and-pastes affected your 2-to-3 conversion attempts.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • Whoops, I will fix the code mistakes... I tried reformatting it for display but screwed it all up in the process. :P Regardless, I can't view the data after I parse it (using a simple "print(data)") because it gives me charmap errors. – Dan Lew Jun 28 '10 at 00:08
  • @Daniel, the problems _after_ you've gotten the data seem to be a separate question from this one about getting the data (which my answer, it appears, responded to -- though seemingly you don't agree, since you didn't even upvote it!). If by `data` you mean the `json.loads(response)`, I can `print` it without any problem (on my Mac Terminal.app, which supports UTF-8). What's your sys.stdout.encoding? Have you set properly the environment variable `PYTHONIOENCODING: Encoding[:errors] used for stdin/stdout/stderr` before starting Python 3? Etc, etc -- totally different issues, see. – Alex Martelli Jun 28 '10 at 01:26
  • Sorry if I was unclear at first. The core problem is I can't *use* the data after parsing, for whatever reason (the print is just the beginning of it; if I can't print it, then somewhere down the line I'm going to run into trouble reading the data). I'll check out the encoding, suffice to say it doesn't work on my W7 machine. – Dan Lew Jun 28 '10 at 13:17
  • @Daniel, if you can't print it, it's perfectly possible that the problem has nothing to do with anything else _except_ the output capability of your Windows terminal -- as http://en.wikipedia.org/wiki/Code_page says, "Most well-known code pages [...] fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single bitmap", meaning they just can't show most Unicode characters. This would not stop you from using your data in any other way -- and we could discuss Unicode woes on Windows **much** better in a Q & A rather than cramped in comments! – Alex Martelli Jun 28 '10 at 13:58
  • If it were just the output capability of the Windows terminal, then why does the code work in Python 2? – Dan Lew Jun 28 '10 at 14:16
  • @Daniel, perhaps by a different setting of sys.stdout.encoding (e.g. via `PYTHONIOENCODING`, etc) -- I've already asked about that and I've heard nothing from you in response in this interminable thread of comments you insist on perpetuating. Why not just `print(repr(data))` in both cases and check if anything is different? If not, then you **know** it's all about output/terminal issues, as I suspect it may well be -- if specific differences, then of course let us know (editing your Q please, **not** in yet another cramped comment!-). – Alex Martelli Jun 28 '10 at 14:32
  • I can't test the code at the moment anyways because reddit itself is down; once I can I'll edit the question with details. I do know that the sys.stdout.encoding is the same between my 2.6 and 3.1 instances (cp437, which I could try setting to something else). – Dan Lew Jun 28 '10 at 14:40
  • @Daniel, CP437 (like most CPs) just won't let you show every Unicode character (a tiny subset, in fact). Type into the Windows console "chcp 65001" (this sets the code page to UTF-8) and change the terminal font to a Unicode font: Right click title bar, Properties, Font, Lucida Console; then `SET PYTHONIOENCODING=utf8`. – Alex Martelli Jun 28 '10 at 15:14
  • The PYTHONIOENCODING solved the problem, but I still want to know why it worked in P2 but not P3. – Dan Lew Jun 29 '10 at 15:10
7

Depends of your python version you have to choose the correct library.

for python 3.5

import urllib.request
data = urllib.request.urlopen(url).read().decode('utf8')

for python 2.7

import urllib
url = serviceurl + urllib.urlencode({'sensor':'false', 'address': address})   
uh = urllib.urlopen(url)
0

Here is an approach that is compatible across both versions - it works by first converting bytes data to string, and then loading the string.

import json
try:
    from urllib.request import Request, urlopen #python3+
except ImportError:
    from urllib2 import Request, urlopen        #python2

url = 'https://jsonfeed.org/feed.json'
request = Request(url)
response_json_string = urlopen(request).read().decode('utf8')
response_json_object = json.loads(response_json_string)
Eric Jarvi
  • 181
  • 1
  • 7
0

Please see that answer in another Unicode related question.

Now: the Python 3 str (which was the Python 2 unicode) type is an idealised object, in the sense that it deals with “characters”, not “bytes”. These characters, in order to be used for/from disk/network data, need to be encoded-into/decoded-from bytes by a “conversion table”, a.k.a encoding a.k.a codepage. Because of operating system variety, Python historically avoided to guess what that encoding should be; this has been changing over the years, but still the principle of “In the face of ambiguity, refuse the temptation to guess.” applies.

Thankfully, a web server makes your work easier. Your response above should give you all extra information needed:

>>> response.headers['content-type']
'application/json; charset=UTF-8'

So, every time you issue a request to a web server, check the Content-Type header for a charset value, and decode the request's data into Unicode (Python 3: bytes.decode(charset)str) by using that charset.

Community
  • 1
  • 1
tzot
  • 92,761
  • 29
  • 141
  • 204