0

Pandas read_csv() returns UnicodeDecodeError on some specific rows. If I use nrows=n1 it works without any error. But when I use nrows=n2 (>n1) somehow it returns UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 12: invalid start byte

It worked fine before, but at some point it keeps me returning the error. Sometimes it works again when I reboot the computer, but only for the first time I try to call it.

Tried read_csv with and without encoding option. Also tried error_bad_lines=False.

This is driving me crazy. Any ideas? If this is related to system issue, at least I want to know how to get the row number of problematic row.

(I exported table from MATLAB with encoding specified as etf-8 (also tried CP949, which is my system's default encoding). Importing from SAS wass successful.)

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
crux26
  • 11
  • 3
  • Which encoding options did you try? You can try to let python detect the encoding, and provide that to `read_csv` as shown [here](https://stackoverflow.com/questions/33819557/unicodedecodeerror-utf-8-codec-while-reading-a-csv-file/33819765#33819765). – rinkert Oct 20 '19 at 08:41
  • tried utf-8, cp949, or let the python to determine as you suggested. All failed miserably. Haven’t tried chardet yet. Thanks for the suggestion! – crux26 Oct 20 '19 at 08:52
  • Use `chardet.detect`, or any text editor able to read your file and tell you what encoding it uses, or one of the many online tools that let you detect your encoding... – Thierry Lathuille Oct 20 '19 at 08:57
  • Don't be subtle: try `encoding='latin1'` in `read_csv` ;) – Quant Christo Oct 20 '19 at 09:23

1 Answers1

0

When using pandas.read_csv there are various parameters one can pass. One of them is the encoding, which will allow the translation of characters. If one is curious to know more about encodings, this can be a good place to start.

As there are a lot of encodings and don't have access to OP's data, one might want to look at this page, where one can find Python standard encodings.

Then, assuming one's file is called data.csv, one will have to use as follows

import pandas as pd

pd.read_csv('data.csv', encoding='iso-8859-1')  # iso-8859-1 is for Western Europe

Again, the list of encodings is vast, so I recommend OP to adjust depending on OP's use case.

From pandas version 1.3.0, argument encoding_errors was added (see this PR). This one influences how encoding errors are handled. See here a list of possible values.

If one wants to replace, considering the enconding used above, then the following should do the work

import pandas as pd

pd.read_csv('data.csv', encoding='iso-8859-1', encoding_errors='replace') 
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83