0

I downloaded the sentiment140 dataset and tried opening it using pd.read_csv() and I got the UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 232719-232720: invalid continuation byte

Then, I specified the 'utf-8' encoding parameter in the read_csv() function after getting the file encoding info using the unix file command but I'm still getting the same error.

i'mgnome
  • 483
  • 1
  • 3
  • 17
  • 1
    `utf-8` was the default, so don't expect specifying `utf-8` to fix it. It's not `utf-8`. Figure out what the real encoding is and specify that. – Mark Tolonen Apr 04 '21 at 21:44
  • @MarkTolonen thanks man, it seems that the actual encoding was **latin1** – i'mgnome Apr 04 '21 at 22:02
  • Maybe, `latin1` can decode anything without error, but it isn't necessarily correct. If the CSV was created on Windows, it is more likely `cp1252`, which has some extra code points like the euro sign and smart quotes that would be mis-decoded with `latin1`. – Mark Tolonen Apr 04 '21 at 22:09

0 Answers0