2

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.

One work around to solve this issue is -

  1. Open the file in standard Notepad.
  2. Press 'Save As' and then a window appears.
  3. Then in the drop down, change encoding to UTF-8.

Now, when you import the files, in SAS or Python, then everything is imported correctly.

But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.

I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.

Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?

cph_sto
  • 7,189
  • 12
  • 42
  • 78

2 Answers2

2

did you try to use codecs library?

import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')
S.C.A
  • 87
  • 9
  • Thanks for your message. Sorry, it did not help. I tried exactly in this manner. First, I read the file and finally wrote it, like shown here https://stackoverflow.com/questions/19591458/python-reading-from-a-file-and-saving-to-utf-8 , but still the same probelm. – cph_sto Aug 27 '18 at 14:12
  • are you using python 2 or 3? – S.C.A Aug 28 '18 at 14:21
  • Hi, I am using Python 3.+ and not Python 2+, the legacy version. – cph_sto Aug 29 '18 at 06:32
0

If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.

with open(filename, 'r', encoding='utf-8') as f:
   # do things with f

If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.

with open(filename, 'r', encoding='latin-1') as inp, \
     open('newfile', 'w', encoding='utf-8') as outp:
    for line in inp:
        outp.write(line)

The ftfy library claims to be able to identify and correct a number of common mojibake problems.

tripleee
  • 175,061
  • 34
  • 275
  • 318