Encoding text with multiple encodings

Question

I am trying to open a txt file in python and reading it using open() and read(), the problem is that some of the text is not UTF-8. Here is the error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1911885: character maps to

How can I read this document?

Possible duplicate of [UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to ](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) — snakecharmerb, May 10 '19 at 12:49
Show a small working piece of code that demonstrates the problem. It looks like you haven't opened the file for reading as UTF-8. — Mark Tolonen, May 10 '19 at 16:52

Atul Gopinathan · Answer 1 · 2019-05-10T11:44:59.283

0

You might wanna check all the answers in this question as it seems pretty similar to yours: UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

As said in the site, try:

file = open(filename, encoding="utf8")

Was planning to share this as a comment but I don't have enough reputations for that :)

EDIT: After reading your comment as a response to my previous answer and as suggested by Cett to improve it:

Probably the best way to deal with encoding errors is by using the errors argument. As said in your question if only some characters are not decoded then this should be fine to use.

file = open(filename, encoding="utf8", errors = "ignore")

NOTE: using this argument will lead to Python ignoring that special character. So I would recommend this only if you are fine with losing some data.

edited May 10 '19 at 11:44

answered May 10 '19 at 10:20

Atul Gopinathan

78
1
8

I tried to do what the comments said, but it says in the text editor that the encoding is in UTF-8 but python can't decode some of it because it's not in utf-8. I also tried to "save with encoding" and utf-8, but it still doesn't work. – orksworms May 10 '19 at 11:04
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/22977262) – Cettt May 10 '19 at 11:27
@Cettt I see, I apologize for not doing that. As I said, my initial intent was to post a comment and not an answer but unfortunately I don't have enough reps to do that. I have improved my answer now as suggested. Thanks for the suggestion and I'll make sure not to repeat it :) – Atul Gopinathan May 10 '19 at 11:47
I wouldn't advocate the `errors='ignore'` parameter. It essentially means you lose data. Characters will be deleted from the input without any traces. It's only a last resort if you have broken input data. – lenz May 10 '19 at 13:37
@lenz True, but then what else can be done? Maybe manually find out the characters from the file and replace them or maybe develop a Python script just for that? – Atul Gopinathan May 10 '19 at 14:48
I suspect that there's still some encoding mismatch in the OP's setup, rather than the editor producing corrupt data. But it's hard to tell for sure, as the details are vague. – lenz May 10 '19 at 20:15
@orksworms If the editor says it is UTF-8, and you get that error, then you are not opening the file as UTF-8. It is not the default on Windows, for example, to use UTF-8 as the default encoding. – Mark Tolonen May 10 '19 at 23:08

Encoding text with multiple encodings

1 Answers1