0

Problem formulation/example

Consider the latin character á, which can be represented as

  • \xe1 in hex

  • \u00e1 in 16-bit hex

  • \U000000e1 in 32-bit hex

In the following code block, I'm decomposing the latin-1 character into an equivalent character with the accent removed (i.e. from á to a):

import unicodedata

decomposed = unicodedata.normalize('NFD', '\xe1') 
encoded = decomposed.encode("utf-8")
letter = chr(list(encoded)[0]) 

print(letter)

(Any of the three bullet-pointed formats could have been used in the second argument of unicodedata.normalize().)

My issue

My issue is in trying to generalise this, whereby the second argument to normalize() is to be an assigned variable.

I'm struggling to do this without explicitly entering the string into the formula because of the escaped backslash.

Example attempt

latin = "á"
a = ascii(latin) # print(a) gives '\xe1'
decomposed = unicodedata.normalize('NFD', a)  
encoded = decomposed.encode("utf-8")
letter = chr(list(encoded)[0]) 

This won't work because the argument a is interpreted as '\\xe1' instead of '\xe1'.

Other attempts to get the hex representation and construct a string by concatenating \x to it won't work either, for the same reason.

quanty
  • 824
  • 1
  • 12
  • 21
  • The backslash escape isn't in the string. It's just how the character is represented when you type a literal in the program, and how the `repr()` function shows it. So there's nothing to remove. – Barmar May 25 '22 at 23:19
  • @Barmar ye that makes sense because if I do a[0] I would just get ‘\’. I’m struggling to see why the function interprets the two differently when the backslash escape isn’t actually in the string. – quanty May 26 '22 at 06:23
  • `'\xe1' == 'á'`. It's just a way to type a single character with an escape code. `ascii('á')` generates the 4-character string `'\\xe1'` where `'\\'` is an escape code for a single, literal backslash. – Mark Tolonen May 26 '22 at 06:24
  • Thanks @MarkTolonen makes sense. So it comes down to the fact that `ascii('á') != '\xe1'`, so it's simply an invalid representation of `'á'` for use in this function. Any idea on how I can obtain the hex (or preferably 32-bit hex) representation of arbitrary unicode characters that can be used in this function? – quanty May 26 '22 at 08:44
  • Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) – JosefZ May 26 '22 at 10:00
  • @JosefZ I'm trying to solve that exact problem myself so I'm reluctant to visit that link... It's just an educational exercise for me – quanty May 26 '22 at 10:41
  • The function takes a character. Just pass the character. You don’t need the hex code. If you have the hex code, `chr()` will return the character – Mark Tolonen May 26 '22 at 13:18
  • So [this my answer](https://stackoverflow.com/a/69650346/3439404) is what you need… We all are learning from existing common solutions rather than [reinventing the wheel](https://en.wikipedia.org/wiki/Reinventing_the_wheel)… – JosefZ May 26 '22 at 14:45

0 Answers0