0

I have a text file with unknown formatting which contains some german characters (umlaut). I want to open this file with python and read it as "utf-8". However, everything I tried out delivers an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1664: invalid continuation byte

What I tried so far:

open(filepath, "rb").read().decode("utf-8")

I also tried:

open(filepath, "r", "utf-8")

I know that I could for instance open up the file in a text editor such as notepad and when I click on "save as" I can choose the encoding of the file. After saving it as utf-8 I can of course process it with python just by calling open(filepath). But how to achieve the same effect using only python (without the text editor step) ? I assume that I could somehow make the decoder work by surpressing errors, but I don't know how...

EDIT: Is there a "general approach" to this problem? I just saw that many of the comments suggest that this file was encoded on a windows machine so I could "guess" the encoding beforehand. However, how should I approach this problem if let's say I develop a software and the user just provides a textfile as an input? I don't want to just output an Error stating that the encoding is wrong. Is there a way to transform any encoding into utf-8 ?

teoML
  • 784
  • 4
  • 13
  • 2
    Obligatory background reading: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Brian61354270 Mar 22 '23 at 23:25
  • Where is the text file coming from? `0xE4` is the [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) (`cp1252`) encoding for `ä`, so there's a good chance that its Windows-1252 encoded. – Brian61354270 Mar 22 '23 at 23:27
  • Does this answer your question? [How to determine the encoding of text](/q/436220/4518341) – wjandrea Mar 22 '23 at 23:36

2 Answers2

0

0xE4 is the Windows-1252 encoding for ä (lower case 'a' with an umlaut), so it looks like your file is Windows-1252 encoded.

To read a Windows-1252 encoded file, you can the encoding name cp1252:

open(filepath, "r", "cp1252")
# or
open(filepath, "rb").read().decode("cp1252")
Brian61354270
  • 8,690
  • 4
  • 21
  • 43
  • Is there a way to decode if the input encoding is unknown? – teoML Mar 24 '23 at 13:25
  • @teoML Not really. See the question linked by wjandrea in the question comments. – Brian61354270 Mar 24 '23 at 14:18
  • @teoML That said, Windows-1252 is arguably the most common encoding after UTF-8. It was Microsoft's default for decades, and in many cases still it. If the file was written by a Windows machine configured for a European language, it's almost certainly that. – Brian61354270 Mar 24 '23 at 14:20
-2

UTF-8 uses anything from one to four bytes to encode a code point, depending on the significance of the code point. You can follow Josh Lee solution UnicodeDecodeError, invalid continuation byte This solution was provided by Josh Lee.

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

If You want to read a Text file you can follow this one:

open('sample-file.txt', mode='r', encoding='utf-8').read()

If You want to write anything into a Text file you can follow this one:

open('a-new-file.txt', mode='w', encoding='utf-8')

Example:

open('questions.txt', mode='w', encoding='utf-8').write('How to read a text file with unknown format and save it as utf-8?')

You can also follow any of this three types if you need.

1.

sample_text_default = open('questions.txt', encoding='utf-8').read()
print(sample_text_default)
sample_text_iso = open('sample-character-encoding.txt', encoding='iso-8859-1').read()
print(sample_text_iso)
sample_text_ascii = open('sample-character-encoding.txt', encoding='ascii').read()
print(sample_text_ascii)

Reference: Melanie Walsh

  • The OP already knows how to open a UTF-8 encoded file. The problem is that that doesn't work because the file is not UTF-8 encoded. – Brian61354270 Mar 23 '23 at 15:18