0

I've seen some people with the same error but all the solutions I tried that were provided to them, didn't work for me. I'm trying to read a file in Python with utf-8 characters but the error "'utf-8' codec can't decode byte 0xe1 in position 2: unexpected end of data" is shown when some "á" or alike appears, although I specified in the code to encode it with utf-8.

input = input("File: ")
epa = open(input, encoding="utf-8")
print(epa.read())

I had it working before, I don't know what did I do for it to stop working with encoding. There's more code, where I'm writing in the file (that worked) but now I've deleted all of it to see if this error remains and it does.

pat
  • 11
  • 3
  • It means the file is not valid UTF-8. Without seeing the offending data, we can't tell you how to fix it, or what encoding to use instead. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Aug 13 '21 at 11:52
  • @tripleee but I'm specifically telling it to encoding it with utf-8 and it used to work with the same file, which contains "opá" – pat Aug 15 '21 at 00:12
  • If that's really true, I can think of three explanations for that. *1.* The file's encoding changed (maybe somebody opened it in an editor and saved it with a different encoding), and so you are no longer actually reading exactly the same data. *2.* The file didn't contain any accented characters before, and so, effectively, now that it does, the file's encoding changed (because previously any ASCII-compatible encoding worked, but now you really have to specify the right one). ... – tripleee Aug 15 '21 at 07:32
  • ... *3.* Somebody wrecked the file so that *no* encoding is correct for all the data in it any longer. This could happen if somebody opened the file for appending, and supplied additional text in a different encoding, or simply added binary data which isn't text at all and for which the concept of text encoding isn't well-defined. – tripleee Aug 15 '21 at 07:32
  • We can help you identify the correct encoding, but only if you show us (ideally, just a small amount of) data in a representation where we can unambiguously understand what we are looking at. The error message from Python tells us that the problematic byte was 0xe1, which is a good but insufficient example. What are the other bytes around it and what is the expected result? The meta question I linked in my first comment explains this in more detail; see also the [`character-encoding` tag info page](http://stackoverflow.com/tags/character-encoding/info) and the guidance for providing a [mre]. – tripleee Aug 15 '21 at 07:36
  • @pat `'opá'.encode('utf8').hex(' ')` returns `'6f 70 c3 a1'`. `'opá'.encode('cp1252').hex(' ')` returns `'6f 70 e1'`. You may be *reading* the file in UTF-8, but you didn't *write* the file with UTF-8. – Mark Tolonen Aug 16 '21 at 16:45

0 Answers0