The sequences we need to replace in your example are actually %u021B
and %u0219
. Googling these, we find that they are "almost" unicode-escaped sequences. The only difference is, they do not start with a backslash, but with a percentage sign. If we had the proper unicode sequences, we could encode it (transform it to bytes) and then decode it again with the encoding "unicode-escape".
So, to transform your input, we replace all %
signs first, and then apply this method:
def custom_decode(string):
return (
string
.replace("%", "\\") # "\\" here is double as it needs to be escaped
.encode()
.decode("unicode-escape")
)
custom_decode("Strada Constitu%u021Biei, Foc%u0219ani 620123, Romania")
# "Strada Constituției, Focșani 620123, Romania"
This limits potential input of our custom_decode
method to strings which do not have a standalone "%"
character, i.e. a % sign that is not indicating an escape sequence.
You might want to read about encoding of strings in general, and in python in particular to get a better understanding of what is going on here.