0

I have a list of peoples names with accents in them however they have not been read properly in the python code, so it produces symbols in place for it.

['/Younghoon',
 '/Riley_Green',
 '/Peter_Popluh%C3%A1r',
 '/Finn_Wolfhard',
 '/Morteza_Pouraliganji',
 '/Cristi%C3%A1n_de_la_Fuente',
 '/Arturo_Carmona',
 '/Fabian_Arnold',
 '/Jack_Avery',
 '/Andres_Camilo',
 '/Utt%282%29',
 '/Eduardo_Barquin',
 '/Adri%C3%A1n_Lastra']

This should instead produce:

['/Younghoon',
 '/Riley_Green',
 '/Peter_Popluhár',
 '/Finn_Wolfhard',
 '/Morteza_Pouraliganji',
 '/Cristián_de_la_Fuente',
 '/Arturo_Carmona',
 '/Fabian_Arnold',
 '/Jack_Avery',
 '/Andres_Camilo',
 '/Utt%282%29',
 '/Eduardo_Barquin',
 '/Adrián_Lastra'

Where Utt%282%29 does not need converting because these are links to a website page, and this is as is for the website name.

Stackcans
  • 351
  • 1
  • 9
  • 1
    OK. So what is the question? – balderman Aug 12 '21 at 12:05
  • Using [__url-decode__](https://www.urldecoder.io/) leads to `/Utt(2)` for `/Utt%282%29`. How should the code know, that this needs to be excluded from decoding? – hc_dev Aug 12 '21 at 12:09
  • 1
    The data in your first example is data that has been read from a file encoded in UTF-8 but has been interpreted as though it was Windows-1252. If you specify `encoding='UTF-8'` when you open that file then you won't have to do any transcontinental or conversion. That is a more straightforward solution than trying to undo the damage afterwards – BoarGules Aug 12 '21 at 12:28
  • @hc_dev seems that this conversion still works and directs to the correct html link – Stackcans Aug 12 '21 at 12:31
  • 1
    @Stackcans That's because you usually can type any characters inside the browser's address-bar or paste even URLs with space inside. The [browser will translate = URL-encode](https://stackoverflow.com/questions/4530173/why-browsers-encode-url-in-this-form) them. Even command-line tools, like _cURL_ or _wget_ do. – hc_dev Aug 12 '21 at 12:38

2 Answers2

1

The %-encoded strings are probably URL-encoded (like usually done in URIs and URLs, because they must be ASCII-only).

To decode a URL-encoded string use urllib.parse.unquote(s).

See also: Url decode UTF-8 in Python

from urllib.parse import unquote

url_encoded_strings = ['/Younghoon',
    '/Riley_Green',
    '/Peter_Popluh%C3%A1r',
    '/Finn_Wolfhard',
    '/Morteza_Pouraliganji',
    '/Cristi%C3%A1n_de_la_Fuente',
    '/Arturo_Carmona',
    '/Fabian_Arnold',
    '/Jack_Avery',
    '/Andres_Camilo',
    '/Utt%282%29',
    '/Eduardo_Barquin',
    '/Adri%C3%A1n_Lastra']

decoded_strings = [unquote(s) for s in url_encoded_strings]

print('\n'.join(decoded_strings))

Gives following output:

/Younghoon
/Riley_Green
/Peter_Popluhár
/Finn_Wolfhard
/Morteza_Pouraliganji
/Cristián_de_la_Fuente
/Arturo_Carmona
/Fabian_Arnold
/Jack_Avery
/Andres_Camilo
/Utt(2)
/Eduardo_Barquin
/Adrián_Lastra
hc_dev
  • 8,389
  • 1
  • 26
  • 38
0

I had same issue before , could you include ", ensure_ascii=False" when you are reading them. For example usage: with open("../example.json", "a", encoding="utf-8") as outfile: json_object = json.dumps(constJSON,indent = 2, ensure_ascii=False) outfile.write(json_object)