0

So i have a very specific problem. I am handling very big csv, files and they are not perfect all the time. Even though the files is in fact encoded with "utf-8", there are wrong bytes in it that cant be decoded. These bytes throw a UnicodeDecodeError: 'charmap' codec can't decode byte 0x9e in position 427: character maps to <undefined>

What i want to do is just ignore and replace the bytes that cant be decoded.I know that pandas 1.3 has a feature to ignore decoding errors, but i am on an older Version of pandas and cant update in the near future. Is there a other way to achieve the same result in older pandas versions?

  • Which version of pandas you are using? – Abdul Niyas P M Aug 26 '22 at 09:12
  • 0.23.3, but i could go up to 1.1 – Leonard Niedermayer Aug 26 '22 at 09:27
  • That's a strange error from UTF-8 decoding, where 9e would be the first byte of a multi-byte sequence. It is much more likely to come from trying to interpret some UTF-8 data as iso-8859-x, where 9e is indeed in an undefined range. Are you sure that you are trying to read the data as UTF-8? – Ture Pålsson Aug 26 '22 at 09:41
  • You are right i posted the wrong Error, when i was testing with different encodings. ```"'utf-8' codec can't decode byte 0x80 in position 2: invalid start byte"``` is the error i get with utf-8 – Leonard Niedermayer Aug 26 '22 at 09:44
  • Very likely, the problem is that your encoding is wrong in the first place. – tripleee Aug 26 '22 at 10:33
  • That might be the case i need to guess the encoding because i dont get the encoding from the file provider, but every encoding guessing method i found is not 100% accurate, so i need a way to ignore charmap errors and drop the errors. Sure i could just use Latin1 or something, but then the whole text would be mostly unreadable. – Leonard Niedermayer Aug 26 '22 at 10:53

0 Answers0