3

I am using requests to fetch data from a resp API. The problem is that when I try to convert the response to UTF-8 it results in some broken characters, e.g.,

if I use response.text I get

response.content = {"description":"Golden Cã£o Mb Adulto 3kg F"} 

if I use response.content I get

response.content = {"description":"Golden C\xc3\xa3\xc2\xa3o Mb Adulto 3kg F"}

I tried to change the request encode using response.encoding = 'utf-8', response.encoding = 'latin-1', and many others before the response.text. I tried response.decode('utf-8') and others decoding as well. In this case I have {"description":"Golden Cã£o Mb Adulto 3kg F"}

I have many other things in this response, if I use response.text.encode('latin-1').decode('utf-8') I can fix some of these broken caracteres but for the above example I get this error

{UnicodeDecodeError}'utf-8' codec can't decode bytes ...: invalid continuation byte

I tried a lot of other things, but I could not fix this. I need some help.


Edit: Server's response headers

{
  'Content-Type': 'application/json',
  'Vary': 'Accept-Encoding',
  'Content-Encoding': 'gzip',
  'Content-Length': '1603',
  'Connection': 'close'
}

For the exemple above the result should be

{"description":"Golden Cão Mb Adulto 3kg F"}

EDIT: Solved It turns out the error was in the server side. The server was corrupting some characters when saving them.

bastelflp
  • 9,362
  • 7
  • 32
  • 67
Rafael
  • 433
  • 6
  • 12
  • Does the request work when you issue it in a browser? In other words, does the data look all-right? What are the server's response headers? – Tomalak Jun 09 '17 at 15:59
  • I added the header response to my question. There is a swagger that I use to small queries, and there it works. – Rafael Jun 09 '17 at 16:34
  • There where 3 questions in my comment and you managed to ignore two of them. – Tomalak Jun 12 '17 at 09:39
  • Sorry, Does the request work when you issue it in a browser? In other words, does the data look all-right? Yes and yes. What are the server's response headers? added to the post. – Rafael Jun 12 '17 at 11:46
  • Okay, I just wanted to make sure that the server does not deliver broken data. Can you also post what the data is supposed to look like when it's not broken? – Tomalak Jun 12 '17 at 12:05
  • For the example I gave, it should be ` {"description":"Golden Cão Mb Adulto 3kg F"} ` But I notice also errors to decode `í` it is decoded as `í-`, `ç` as `ç`, `Á` as `Ã` – Rafael Jun 12 '17 at 12:24
  • 1
    Yes, that's because of incorrectly decoding the incoming bytes. `\xc3\xa3\xc2\xa3` is a very curious byte sequence. `C3 A3` is UTF-8 for [LATIN SMALL LETTER A WITH TILDE](http://www.fileformat.info/info/unicode/char/e3/index.htm), which seems correct. But `C2 A3` is nothing useful, really. Strange. (Well, it is the [POUND SIGN](http://www.fileformat.info/info/unicode/char/a3/index.htm), so the value `requests.text` is giving you is correct, but the fact that you say there is no pound sign in the browser is strange.) Is the URL you work with publicly available so I can test it from my end? – Tomalak Jun 12 '17 at 12:33
  • Sorry it is not public. But I noticed that too. Then I found [this article](http://www.i18nqa.com/debug/utf8-debug.html). There is a table mapping `ã£` as `ã` – Rafael Jun 12 '17 at 13:06
  • 1
    I think I fixed, after your comment I realized that the first sequence is the character i want and the second is "garbage". the second sequence, sometimes, is not recognized as UTF-8. So, I just ignored those. `response.text.encode('latin-1').decode('utf8','ignore')` seems to work. – Rafael Jun 12 '17 at 13:25
  • I'm not really sure that that is the solution because the browser does not show garbage. It seems pretty "hacky" to me, I'd not be so sure that you don't create a bunch of new errors this way. – Tomalak Jun 12 '17 at 13:27
  • You are right, there is still some flaws. but it worked for a good part of the result. For example, it worked for `ã£` but nor for `ã` – Rafael Jun 12 '17 at 13:37
  • That's what I expected. Your current attempt is not useful. I'd like to take the requests module out of the equation. Please use [`urllib.request.urlopen()`](https://stackoverflow.com/a/645318/18771) to download the data and add the raw result you get from there to your question. – Tomalak Jun 12 '17 at 13:42
  • 1
    The error is in the server side, I mean, there are more than one entry "Golden Cão Mb Adulto 3kg F" and one of those has the broken character. When I said that in the browser was working I was looking at the wrong entry. – Rafael Jun 12 '17 at 14:57
  • 2
    All-right, I suspected it might be broken data. Not a lot you can do on the client side, then. The `requests` module does the right thing. – Tomalak Jun 12 '17 at 15:00
  • @Rafael Would you edit this to your question or add as an (accepted) answers, so that others can find this? Currently it is buried in the comments. – bastelflp Nov 12 '17 at 11:32
  • 1
    I added to the question. Tks – Rafael Nov 13 '17 at 12:18

0 Answers0