remove special codes convert to normal (`%u021`, `%u0219`)

Question

I have variables with this value : 'Strada Constitu%u021Biei, Foc%u0219ani 620123, Romania'

I need to remove this codes %u021, %u0219 .. i tried all tutorial on internet, without success.

How i can convert this string to normal caracters ?

I need this output :

'Strada Constitutiei, Focsani 620123, Romania'

@jonathan.scholbach how i can convert my string in normal caracters ? — Lucian Blaga, Oct 12 '21 at 22:33
Where do these strings come from? Are the `%u` sequences encoded? If so, what encoding are they using? I could write an answer assuming they're not encoded, but if they are, the code would choke on certain inputs. — wjandrea, Oct 12 '21 at 23:19
Related: [Python unicode codepoint to unicode character](/q/10715669/4518341), [What is the best way to remove accents (normalize) in a Python unicode string?](/q/517923/4518341), [Using f-strings with unicode escapes](/q/69380897/4518341) — wjandrea, Oct 12 '21 at 23:22
@Lucian By "normal characters", you mean unaccented, right? i.e. `'%u021B'` -> `'ț'` -> `'t'`? — wjandrea, Oct 12 '21 at 23:23

Jonathan Scholbach · Accepted Answer · 2021-10-12T23:34:20.300

The sequences we need to replace in your example are actually %u021B and %u0219. Googling these, we find that they are "almost" unicode-escaped sequences. The only difference is, they do not start with a backslash, but with a percentage sign. If we had the proper unicode sequences, we could encode it (transform it to bytes) and then decode it again with the encoding "unicode-escape".

So, to transform your input, we replace all % signs first, and then apply this method:

def custom_decode(string):
    return (
       string
           .replace("%", "\\")  # "\\" here is double as it needs to be escaped
           .encode()
           .decode("unicode-escape")
    )

custom_decode("Strada Constitu%u021Biei, Foc%u0219ani 620123, Romania") 
# "Strada Constituției, Focșani 620123, Romania"

This limits potential input of our custom_decode method to strings which do not have a standalone "%" character, i.e. a % sign that is not indicating an escape sequence.

You might want to read about encoding of strings in general, and in python in particular to get a better understanding of what is going on here.

Careful, if the `%u` sequences are encoded, this won't work. With JSON for example: `s = '\N{grinning face}'; j = json.dumps(s).replace('\\', '%'); e = json.loads(j)` -> `'%ud83d%ude00'`, then `custom_decode(e)` -> `'\ud83d\ude00'`. This is using UTF-16. — wjandrea, Oct 12 '21 at 23:36

remove special codes convert to normal (`%u021`, `%u0219`)

1 Answers1