2

I am facing an issue with pandas read_csv. I have a file, which contains " as field value. In reality, that should not be the case, but I have no influence on file generation, due to which I have to find a workaround.

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 15345

I found a issue report on this on Git (link here), where they suggest to use delimiter that is used for "sep" parameter also for "quotechar". In this case, structure of file gets messed up.

Another thing that I did was to add an exception to this, which will run code for rest of the files, but I will keep having that issue for that particular type of files.

Command that I use to read CSV file:

df_new = pd.read_csv(file_path_name, sep=";", error_bad_lines=False)

Any idea of a workaround for this (e.g. ignore line with this issue)? One way I guess would be to use csv library to remove that line (or replace " with something else), but I would like to keep it simple and do as much as possible within pandas.

Python version: 3.6.2

Pandas version: 0.21.0

Thank you and best regards

Bostjan
  • 1,455
  • 3
  • 14
  • 22
  • Can you post sample erroneous records from your csv file? I asked because I'm trying to replicate your issue but did not encounter it. – oim Dec 29 '17 at 13:17
  • 1
    `error_bad_lines=False` is a workaround. But a solution would involve seeing the row that is causing the loading to error out. – cs95 Dec 29 '17 at 13:23
  • Hello! I have tried this and this only solves an issue, where there would be too many/few fields in a line, it would still not read a file into a dataframe when EOF issue occurs @user8505495: here is a sample record that is causing the issue (delimiter set to ;) ";";testmail@mail.com;PRI;BUS;0;1;0.00;;;;;ACTIVE;1;;;0;TRUE;GEO;FALSE;1 – Bostjan Dec 31 '17 at 00:29
  • 1
    Did you try what is described here (https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5) that is adding quoting=csv.QUOTE_NONE? – oim Jan 01 '18 at 15:23
  • @user8505495, thank you for that, it did indeed solve my issue :) I saw this setting before, but I had no idea what it does, so I did not even try it. One thing to mention for anyone, who would find this useful - using quoting=3 in to_csv command was causing errors for me (missing escape char), so I only used this option for read_csv. Thank you again for help! – Bostjan Jan 02 '18 at 15:39
  • Possible duplicate of [Pandas ParserError EOF character when reading multiple csv files to HDF5](https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5) – IanS Nov 16 '18 at 10:31

1 Answers1

5

Would just like to point out that suggestion from @user8505495 worked (thank you again).

Basically just adding parameter quoting=3 to read_csv. Using same parameter in to_csv caused and error (missing escape character). One option is to set up escapechar parameter, or just not use quoting parameter for it.

Bostjan
  • 1,455
  • 3
  • 14
  • 22