0

I have XML-files with data in such format:

<DataBlock>
  <Text>Name1</Text>
  <Text>Name2</Text>
  <Text>Name3</Text>
<DataBlock>

Data can be in different languages include Arabic, Chineese, Cirillic etc. Also this data can contain the latin text in different gothic or handwriting UTF characters like this:

<Text></Text> 

or

<Text></Text>

But I need to save them as a plain text like this:

<Text>ABB</Text> 
<Text>ZERO</Text>

As I understand it, the problem is that the text is now saved in long UTF codes:

UTF codes

But how can I convert it to simple latin symbols?

Zoe
  • 27,060
  • 21
  • 118
  • 148
Vs Kc
  • 47
  • 3
  • 2
    This is not a font problem. You are using here the unicode characters U+1D4D0 (MATHEMATICAL BOLD SCRIPT CAPITAL A) and U+1D4D1 (MATHEMATICAL BOLD SCRIPT CAPITAL B). That means that the problem is a correct problem that IMHO deserves a true answer, but the question was misleading. – Serge Ballesta Apr 17 '20 at 09:30
  • 3
    `unicodedata.normalize('NFKD', "")` => 'ZERO'` I disagree that this should be closed. It would be good to talk about how this works. You can normalize such text, which can be useful. – Todd Apr 17 '20 at 09:47
  • This is not an answer by itself but a hint: the mathematical alphanumeric symbols block is described [here](https://unicode.org/charts/PDF/U1D400.pdf) – Serge Ballesta Apr 17 '20 at 09:48
  • @Todd: your usage of NKFD here indeed deserves to be in an upvoted answer! – Serge Ballesta Apr 17 '20 at 09:50
  • `unicodedata.normalize('NFKD', "") => 'ABB'` See if that works for you VsKc – Todd Apr 17 '20 at 09:50
  • @Todd: Thank you, that's what I was looking for! – Vs Kc Apr 17 '20 at 10:51

0 Answers0