33

I am using the requests library to query the Diffbot API to get contents of an article from a web page url. When I visit a request URL that I create in my browser, it returns a JSON object with the text in Unicode (right?) for example (I shortended the text somewhat):

{"icon":"http://mexico.cnn.com/images/ico_mobile.jpg","text":"CIUDAD DE MÉXICO (CNNMéxico) \u2014 Kassandra Guazo Cano tiene 32 años, pero este domingo participó por primera vez en una elección.\n\"No había sacado mi (credencial del) IFE (Instituto Federal Electoral) porque al hacer el trámite hay mucha mofa cuando ven que tu nombre no coincide con tu y otros documentos de acuerdo con su nueva identidad.\nSánchez dice que los solicitantes no son discriminados, pero la experiencia de Kassanda es diferente: \"hay que pagar un licenciado, dos peritos (entre ellos un endocrinólogo). Además, el juez dicta sentencia para el cambio de nombre y si no es favorable tienes que esperar otros cuatro años para volver a demandar al registro civil\".\nAnte esta situación, el Consejo para Prevenir y Eliminar la sculina, los transgénero votan - México: Voto 2012 - Nacional","url":"http://mexico.cnn.com/nacional/2012/07/02/con-apariencia-de-mujer-e-identidad-masculina-los-transexuales-votan","xpath":"/HTML[1]/BODY[1]/SECTION[5]/DIV[1]/ARTICLE[1]/DIV[1]/DIV[6]"}

When I use the python request library as follows:

def get_article(self, params={}):
  api_endpoint = 'http://www.diffbot.com/api/article'
  params.update({
    'token': self.dev_token,
    'format': self.output_format,
  })
  req = requests.get(api_endpoint, params=params)
  return json.loads(req.content)

It returns this (again note that I shortened the text somewhat):

{u'url': u'http://mexico.cnn.com/nacional/2012/07/02/con-apariencia-de-mujer-e-identidad-masculina-los-transexuales-votan', u'text': u'CIUDAD DE M\xc9XICO (CNNM\xe9xico) \u2014 Kassandra Guazo Cano tiene 32 a\xf1os, pero este domingo particip\xf3 por primera vez en una elecci\xf3n.\n"No hab\xeda sacado mi (credencial del) IFE (Instituto Federal Electoral) porque al hacOyuky Mart\xednez Col\xedn, tambi\xe9n transg\xe9nero, y que estaba acompa\xf1ada de sus dos hijos y su mam\xe1.\nAmbas trabajan como activistas en el Centro de Apoyo a las Identidades Trans, A.C., donde participan en una campa\xf1a de prevenci\xf3n de enfermedades sexuales.\n"Quisi\xe9ramos que no solo nos vean como trabajadoras sexuales o estilistas, sino que luchamos por nuestros derechos", dice Kassandra mientras sonr\xede, sostiene su credencial de elector y levanta su pulgar entintado.', u'title': u'Con apariencia de mujer e identidad masculina, los transg\xe9nero votan - M\xe9xico: Voto 2012 - Nacional', u'xpath': u'/HTML[1]/BODY[1]/SECTION[5]/DIV[1]/ARTICLE[1]/DIV[1]/DIV[6]', u'icon': u'http://mexico.cnn.com/images/ico_mobile.jpg'}

I don't quite understand Unicode. How to make sure that what I get with requests is still Unicode?

Jeremy
  • 1
  • 85
  • 340
  • 366
Javaaaa
  • 3,788
  • 7
  • 43
  • 54
  • Looks like you have unicode strings in that json result. Notice the "u'...'" notation? You can also check the type of some of the result: `type(result['text'])`. http://docs.python.org/howto/unicode.html – istruble Jul 11 '12 at 15:04
  • Thanks! I see it is unicode indeed with the u'', however it says prevenci\xf3n (when using requests) instead of preferiría (in browser) for example. How can I make it that that prevenci\xf3n is preferiría? – Javaaaa Jul 11 '12 at 15:56
  • 1
    That's just plain old string literal syntax. Python shows you `\xc9` because that's safe to print on all consoles, whereas `É` would fail on consoles that don't support Unicode properly. If your console is working, you can see they are the same. `>>> u'CIUDAD DE M\xc9XICO'==u'CIUDAD DE MÉXICO'` is True. – bobince Jul 11 '12 at 21:36

3 Answers3

43

You can use req.text instead of req.content to ensure that you get Unicode. This is described in:

https://requests.readthedocs.io/en/latest/api/#requests.Response.text

TTT
  • 6,505
  • 10
  • 56
  • 82
  • 5
    is there a way to not get a unicode JSON response with Requests? It seems like the only way to print out a JSON response to terminal without the unicode is to print the JSON as a string, but then of course it's not a data structure anymore, just a string. Is there a way to deal with Request responses as "pure" JSON still as a data structure but with no Unicode? – AdjunctProfessorFalcon Jul 17 '15 at 19:57
  • 4
    NB: In some instances it is necessary to do `response.content.decode('utf-8')` to convert the raw bytes to UTF-8. – Jay Taylor Sep 16 '16 at 20:01
15

Concerning the "I don't quite understand unicode", there's an entertaining primer on Unicode by Joel Spolsky and the official Python Unicode HowTo which is a 10 minute read and covers everything Python specific.

The requests docs say that request will always return unicode, and the example content you posted is in fact unicode (notice the u'' string syntax? That's Python's syntax for unicode strings.), so there's no problem. Note that if you view the JSON response in a web browser, the u'' will not be there because it's a property of how Python stores a string.

If unicode is important to your application, please don't try to cope without really knowing about unicode. You're in for a world of pain, character set issues are extremely frustrating to debug if you don't know what you're doing. Reading both articles mentioned above maybe takes half an hour.

Simon
  • 12,018
  • 4
  • 34
  • 39
  • 1
    Add to those links Ned Batchelder's excellent presentation on Pragmatic Unicode in Python: http://nedbatchelder.com/text/unipain.html – bgporter Jul 11 '12 at 15:10
  • Thanks! I see it is unicode indeed with the u'', however it says prevenci\xf3n (when using requests) instead of preferiría (in browser) for example. How can I make it that that prevenci\xf3n is preferiría? – Javaaaa Jul 11 '12 at 15:50
  • Python uses an escape sequence like `\xf3` for every non-ASCII character when *displaying* a unicode string. Looking at the Unicode charts at http://www.unicode.org/charts/, you'll see it's an "ó", so it's all right. If you want to see the actual character, you'll need to encode the unicode string. – Simon Jul 11 '12 at 16:01
1

Try response.content.decode('utf-8') if response.text doesn't work.

According to the documentation, the main problem is that the encoding guessed by requests is determined based solely on the HTTP headers. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you can set response.encoding before accessing response.text.

Credit goes to Jay Taylor for commenting on TTT's answer - I almost missed the comment and thought it deserved its own answer.

stephentgrammer
  • 470
  • 6
  • 16