-2

I have variables with this value : 'Strada Constitu%u021Biei, Foc%u0219ani 620123, Romania'

I need to remove this codes %u021, %u0219 .. i tried all tutorial on internet, without success.

How i can convert this string to normal caracters ?

I need this output :

'Strada Constitutiei, Focsani 620123, Romania'
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Lucian Blaga
  • 137
  • 2
  • 7
  • @jonathan.scholbach how i can convert my string in normal caracters ? – Lucian Blaga Oct 12 '21 at 22:33
  • Where do these strings come from? Are the `%u` sequences encoded? If so, what encoding are they using? I could write an answer assuming they're not encoded, but if they are, the code would choke on certain inputs. – wjandrea Oct 12 '21 at 23:19
  • Related: [Python unicode codepoint to unicode character](/q/10715669/4518341), [What is the best way to remove accents (normalize) in a Python unicode string?](/q/517923/4518341), [Using f-strings with unicode escapes](/q/69380897/4518341) – wjandrea Oct 12 '21 at 23:22
  • 1
    @Lucian By "normal characters", you mean unaccented, right? i.e. `'%u021B'` -> `'ț'` -> `'t'`? – wjandrea Oct 12 '21 at 23:23

1 Answers1

2

The sequences we need to replace in your example are actually %u021B and %u0219. Googling these, we find that they are "almost" unicode-escaped sequences. The only difference is, they do not start with a backslash, but with a percentage sign. If we had the proper unicode sequences, we could encode it (transform it to bytes) and then decode it again with the encoding "unicode-escape".

So, to transform your input, we replace all % signs first, and then apply this method:

def custom_decode(string):
    return (
       string
           .replace("%", "\\")  # "\\" here is double as it needs to be escaped
           .encode()
           .decode("unicode-escape")
    )

custom_decode("Strada Constitu%u021Biei, Foc%u0219ani 620123, Romania") 
# "Strada Constituției, Focșani 620123, Romania"

This limits potential input of our custom_decode method to strings which do not have a standalone "%" character, i.e. a % sign that is not indicating an escape sequence.

You might want to read about encoding of strings in general, and in python in particular to get a better understanding of what is going on here.

Jonathan Scholbach
  • 4,925
  • 3
  • 23
  • 44
  • 1
    Careful, if the `%u` sequences are encoded, this won't work. With JSON for example: `s = '\N{grinning face}'; j = json.dumps(s).replace('\\', '%'); e = json.loads(j)` -> `'%ud83d%ude00'`, then `custom_decode(e)` -> `'\ud83d\ude00'`. This is using UTF-16. – wjandrea Oct 12 '21 at 23:36