Why am I keep getting a UnicodeDecodeError in pandas read_csv() function even though I specified the correct encoding parameter?

Asked Apr 04 '21 at 21:22

Active Apr 04 '21 at 21:22

Viewed 67 times

I downloaded the sentiment140 dataset and tried opening it using pd.read_csv() and I got the UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 232719-232720: invalid continuation byte

Then, I specified the 'utf-8' encoding parameter in the read_csv() function after getting the file encoding info using the unix file command but I'm still getting the same error.

asked Apr 04 '21 at 21:22

i'mgnome

1

`utf-8` was the default, so don't expect specifying `utf-8` to fix it. It's not `utf-8`. Figure out what the real encoding is and specify that. – Mark Tolonen Apr 04 '21 at 21:44
@MarkTolonen thanks man, it seems that the actual encoding was **latin1** – i'mgnome Apr 04 '21 at 22:02
Maybe, `latin1` can decode anything without error, but it isn't necessarily correct. If the CSV was created on Windows, it is more likely `cp1252`, which has some extra code points like the euro sign and smart quotes that would be mis-decoded with `latin1`. – Mark Tolonen Apr 04 '21 at 22:09

Why am I keep getting a UnicodeDecodeError in pandas read_csv() function even though I specified the correct encoding parameter?

0 Answers0