I have a text file with unknown formatting which contains some german characters (umlaut). I want to open this file with python and read it as "utf-8". However, everything I tried out delivers an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1664: invalid continuation byte
What I tried so far:
open(filepath, "rb").read().decode("utf-8")
I also tried:
open(filepath, "r", "utf-8")
I know that I could for instance open up the file in a text editor such as notepad and when I click on "save as" I can choose the encoding of the file. After saving it as utf-8 I can of course process it with python just by calling open(filepath)
.
But how to achieve the same effect using only python (without the text editor step) ?
I assume that I could somehow make the decoder work by surpressing errors, but I don't know how...
EDIT: Is there a "general approach" to this problem? I just saw that many of the comments suggest that this file was encoded on a windows machine so I could "guess" the encoding beforehand. However, how should I approach this problem if let's say I develop a software and the user just provides a textfile as an input? I don't want to just output an Error stating that the encoding is wrong. Is there a way to transform any encoding into utf-8 ?