4

We hav some text containing german umlauts represented using e.g. 'a' + COMBINING DIAERESIS ($cc $88).

Any idea how to convert such text properly to utf8?

  • So you have two code points, or you have some encoded form? Output the character to a UTF-8 stream, and it will be encoded. But see Ignacio’s answer about normalization. You probably want to normalize to NFC. – tchrist Apr 21 '11 at 18:10

1 Answers1

5

First, if it's not already a unicode then decode it. Second, unicodedata.normalize(). Third, encode.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • Specifically, you would want it in NFC form if you’re outputting it. NFD is more often used internally. – tchrist Apr 21 '11 at 18:10