How to convert API sourced String Unicode to UTF-8

Question

I'm having trouble converting data from the following list into UTF-8, I have to convert it in order to insert it into a SQL table using pyodbc:

cod_cat = ['Transfer\\u00eancia', 'Entrada de Transfer\\u00eancia', 'Sa\\u00edda de Transfer\\u00eancia']

The same data when I try it on postman:

cod_cat = ['Transferência', 'Entrada de Transferência','Saída de Transferência']

This list comes from an API request using the following code:

res = requests.post(url, json=teste)
resposta = res.text

I've used an function that finds specific delimiters in the request response and adds the string in between these delimiters everytime they show up.

I've tried to:

for i in lista_cat:
     i.decode('utf-8', 'ignore')

and, before the library imports:

# -*- coding: utf-8 -*

Also I tried to encode() and then decoding again, however all three didn't get me proper results, I've tried to find out which encoding type "res" is getting me but as I found in other questions

Correctly detecting the encoding all times is impossible.

My guess is that the .text isn't the best way to get these special characters data, however I don't know any other way around that.

Here is a sample of what res.text looks like:

"definida_pelo_usuario":"N","descricao":"Transfer\u00eancia","descricao_padrao":"Transfer\u00eancia"

The function I used in order to make the list:

resposta = res.text
def achar_str(inicio, delimitante):
    foo = resposta
    bar = inicio
    lista = []
    idx = 0
    ind = 0
    countador = 0
    while True:
        try:

            # find "foo"
            found_at = foo.index(bar, idx)
            achar = foo.index(delimitante, ind)

            # move the index to after "bar"
            idx = found_at + len(bar)
            ind = achar + len(produto)

            # append with strings between delimiters
            lista.append(res.text[found_at + len(bar):achar])
            # contador
            countador += 1

        except ValueError:
            return lista

When you write `res.text` to a file (make sure you use `open('testfile.txt', 'w', encoding='utf8')` to open the file), what is the content of that file? — Tomalak, Jan 12 '21 at 16:30
@snakecharmerb `res.json` does work however I have a pretty big code around it being a string not a dict, do you think that's the only way to solve it? Converting the json into str would result in the same error wouldn't it? — thiagopleasehelp, Jan 12 '21 at 16:34
@Tomalak I'm not currently writing a file as it is not necessary for my code, I simply store the string in a variable named `resposta` — thiagopleasehelp, Jan 12 '21 at 16:35
@thiagopleasehelp I know. I want to verify that the data is not already broken at the source, and I want to get a complete sample of what the API actually delivers. — Tomalak, Jan 12 '21 at 16:37
@Tomalak I'm sorry, I didn't understand. when I do write a file I get the same result: `"descricao":"Transfer\u00eancia","descricao_padrao":"Transfer\u00eancia"` Thanks for the patience so far! — thiagopleasehelp, Jan 12 '21 at 16:41
It's difficult to be sure without seeing how you are constructing your list, but I would suspect that the values in the JSON are correct, but your code is corrupting them somehow: for example in the json you have `"Transfer\u00eancia"` (`print("Transfer\u00eancia")` -> `Transferência`) but in your list `Transfer\\u00eancia` (note the doubled backslash). — snakecharmerb, Jan 12 '21 at 16:43
@thiagopleasehelp Then the data is already broken when you get it. That's what I had suspected. What happens if you request this data with a browser (or a command line tool like wget or curl)? Can you share the response headers? — Tomalak, Jan 12 '21 at 16:45
I'll add the list making function into my question. You are right, the JSON values work absolutely fine, however when I use `.text` every special character turns into `\u00` even before I put them into a list — thiagopleasehelp, Jan 12 '21 at 16:47
@Tomalak when I use my browser or some backend testing software such as Postman, I get the characters absoluterly fine, `Transfer\u00eancia` is `Transferência` as expected — thiagopleasehelp, Jan 12 '21 at 16:53
You are confused. There is no UTF-8 here. `\u00ea` is the JSON representation of `ê`. If you are receiving JSON, decode it as JSON; your code and your question are however too unclear to decide if this is really the case. — tripleee, Jan 12 '21 at 16:57
@tripleee you are correct, I do recieve a JSON, I guess I'll have to rewrite my code. Thanks, also, any tips on how to be more clear both in my question and my code? — thiagopleasehelp, Jan 12 '21 at 17:07
`[x.encode('raw_unicode_escape').decode('unicode_escape') for x in cod_cat]` returns `['Transferência', 'Entrada de Transferência', 'Saída de Transferência']` — JosefZ, Jan 12 '21 at 17:13
@JosefZ this works!!!! Would it be better in future codes to get the JSON itself instead of the `.text` ? — thiagopleasehelp, Jan 12 '21 at 17:23
Read [this thread](https://stackoverflow.com/q/62821643/3439404). — JosefZ, Jan 12 '21 at 17:57

How to convert API sourced String Unicode to UTF-8

0 Answers0