UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?

Question

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is Ã¤ instead of ä, Ã instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.

One work around to solve this issue is -

Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.

Now, when you import the files, in SAS or Python, then everything is imported correctly.

But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.

I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.

Is there any Python library which can automatically translate these strange characters into their proper characters - like Ã¤ gets translated to ä and so on?

score 2 · Answer 1 · answered Aug 27 '18 at 12:13

2

did you try to use codecs library?

import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')

answered Aug 27 '18 at 12:13

S.C.A

87
9

Thanks for your message. Sorry, it did not help. I tried exactly in this manner. First, I read the file and finally wrote it, like shown here https://stackoverflow.com/questions/19591458/python-reading-from-a-file-and-saving-to-utf-8 , but still the same probelm. – cph_sto Aug 27 '18 at 14:12
are you using python 2 or 3? – S.C.A Aug 28 '18 at 14:21
Hi, I am using Python 3.+ and not Python 2+, the legacy version. – cph_sto Aug 29 '18 at 06:32

score 0 · Answer 2 · answered Jan 13 '21 at 13:29

If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.

with open(filename, 'r', encoding='utf-8') as f:
   # do things with f

If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.

with open(filename, 'r', encoding='latin-1') as inp, \
     open('newfile', 'w', encoding='utf-8') as outp:
    for line in inp:
        outp.write(line)

The ftfy library claims to be able to identify and correct a number of common mojibake problems.

UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?

2 Answers2