41

Essentially I made a request to a website and got a byte response back: b'[{"geonameId:"703448"}..........'. I'm confused because although it is of type byte, it is very human readable and appears like a list of json. I do know that the response is encoded in latin1 from running r.encoding which returned ISO-859-1 and I have tried to decode it, but it just returns an empty string. Here's what I have so far:

r = response.content
string = r.decode("ISO-8859-1")
print (string)

and this is where it prints a blank line. However when I run

len(string)

I get: back 31023 How can I decode these bytes without getting back an empty string?

koda gates
  • 413
  • 1
  • 4
  • 5
  • in python 2.x the b prefix will cause the enclosed string to become a type `str` you may have some encoded characters already hidden somewhere within. On Python 3.x you will receive a `bytes` literal. why do you believe you need to perform any encoding/decoding? – Mike McMahon Jul 29 '15 at 18:44
  • Because I need to parse the json, and I just tried looping over it: with `for i in range(len(contents)): print content[i]` and it's just printing out lots of numbers. – koda gates Jul 29 '15 at 18:50

4 Answers4

37

Did you try to parse it with the json module?

import json
parsed = json.loads(response.content)
mzc
  • 3,265
  • 1
  • 20
  • 25
  • 7
    Yes and I got: `JSON object must be str, not 'bytes'` – koda gates Jul 29 '15 at 19:04
  • 4
    And when you do `json.loads(response.content.decode('latin1'))`? – mzc Jul 29 '15 at 19:14
  • There should be a header in the response object telling you what encoding it has. You should decode the content with that codec, otherwise any unusual characters (emoji, accents, some quote characters, ...) will end up garbled. See the Answer from @salah – drevicko Mar 01 '17 at 10:28
  • @mzc Please add the `content.decode` comment directly to the answer. – Martin Thoma Oct 10 '18 at 09:41
  • @mzc, decode('latin1') doesn’t work always, in case of the content-type is `text/html; charset=UTF-8`, it fails. – Anu Oct 14 '19 at 19:14
30

Another solution is to use response.text, which returns the content in unicode

Type:        property
String form: <property object at 0x7f76f8c79db8>
Docstring:  
Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using
``chardet``.

The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
salah
  • 439
  • 4
  • 7
  • 4
    This is a much better idea than the accepted answer, as it will use the appropriate encoding. – drevicko Mar 01 '17 at 10:31
  • 1
    Yes, this is what is suggested in the docs: http://docs.python-requests.org/en/master/user/quickstart/#response-content – Jérôme Dec 04 '17 at 12:01
13

There is r.text and r.content. The first one is a string, the second one is bytes.

You want

import json

data = json.loads(r.text)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
3

I faced a similar issue using beautifulsoup4 and requests while scraping webpages, however both response.text and response.content looked like it was bytes.

The response headers included 'Content-Type': 'text/html; charset=UTF-8' encoding in the headers, also had this in the response headers - 'Content-Encoding': 'br'. It turns out I hadn't installed brotlipy in the environment and running pip install brotlipy fixed my issues. I thought chardet or cchardet would be enough, but the data needed to be correctly decompressed.

A similar issue was solved here in the same way, and linking to this answer since it didn't come up until I explicitly searched for brotli compression.

KT12
  • 549
  • 11
  • 24