0

Given are two strings, one with a keyword and one with a text. We want to check if the keyword is in the text. With umlauts, in this case the 'ü', I encounter an error. If I translate the string with the encode() function of Python into utf-8 I get for the one 'ü' u\xcc\x88 and for the other \xc3\xbc. So it seems that the two are using some other encoding. In the print() statements it is not noticeable, here both are output reasonably formatted e.g. "Münze" "Münze" but in the count() function with which I want to find the keyword, it finds no hits due to this error. How can I correct this and format both strings consistent?

Fred
  • 17
  • 5
  • 3
    My immediate thought is that you have unicode and some of it isn't normalized. https://stackoverflow.com/questions/16467479/normalizing-unicode – Mikael Öhman Jul 21 '23 at 09:58

1 Answers1

0

The problem can be solved by normalizing the string. For this you can use the .normalize function of the python package unuicodedata.

string = unicodedata.normalize('NFC', string)

Credits to Mikael Öhman for the idea.

Fred
  • 17
  • 5