Pandas CParserError: Error tokenizing data

Question

I have a large csv file with 25 columns, that I want to read as a pandas dataframe. I am using pandas.read_csv(). The problem is that some rows have extra columns, something like that:

        col1   col2   stringColumn   ...   col25
1        12      1       str1                 3
...
33657    2       3       str4                 6       4    3 #<- that line has a problem
33658    1      32       blbla                 #<-some columns have missing data too

When I try to read it, I get the error

CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28

The problem does not happen if the extra values appear in the first rows. For example if I add values to the third row of the same file it works fine

#that example works: 
           col1   col2   stringColumn   ...   col25
    1        12      1       str1                 3
    2        12      1       str1                 3
    3        12      1       str1                 3       f    4
    ...
    33657    2       3       str4                 6       4    3 #<- that line has a problem
    33658    1      32       blbla                 #<-some columns have missing data too

My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.

Skipping the offending lines like suggested here is not an option, those lines contain valuable information.

Does anybody know a way around this?

How do you expect those lines to be stored and represented? For instance are you after 25 or 28 columns? — EdChum, Apr 28 '16 at 13:14
I am only interested in about half of the 25 columns. The values in the extra column do not interest me. — Sininho, Apr 28 '16 at 13:16
Are the number of columns fixed or not? i.e. is it always 28 — EdChum, Apr 28 '16 at 13:17
Could you not just do `pd.read_csv(file_path, delim_whitespace=True, usecols=np.arange(25))`? — EdChum, Apr 28 '16 at 13:18
I tried `data = pd.read_csv(f, skiprows=2, delimiter="\t", dtype='str', index_col=0, usecols=np.arange(25))` and got `ValueError: Usecols do not match names`. (the fields are separeted by \t and there is one extra line at the beginning, therefore the arguments `skiprows` and `delimiter` — Sininho, Apr 28 '16 at 13:26

Sininho · Answer 1 · 2020-01-15T10:11:52.840

Since I did not find an answer that completely solves the problem, here is my work around: I found out that explicitly passing the column names with the option names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '') allows me to read the file. It forces me to read and parse every column, which is not ideal since I only need about half of them, but at least I can read the file now. Combinining the arguments names and usecols and does not work, if somebody has another solution I would be happy to hear it.

score 0 · Answer 2 · edited May 23 '17 at 12:00

In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv. I decided that actually doing so is the more proper and elegant solution. I found this post quite useful.

Can I redirect the stdout in python into some sort of string buffer?

I added a little twist to the code shown in the answer.

import sys
import re
from cStringIO import StringIO
import pandas as pd

fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()

df1 = pd.read_csv(StringIO(fake_csv),
                  error_bad_lines=False)

sys.stderr = old_stderr 
log = mystderr.getvalue()
isnum = re.compile("\d+")

lines_skipped_log = [
    isnum.findall(i) + [fname]\
    for i in log.split("\n") if isnum.search(i)
        ]

columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)

From there you can do anything you want with lines_skipped_log such as output to csv, create a dataframe etc.

Perhaps you have a directory full of files. You can create a list of pandas data frames out of each log and concatenate. From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).

score 0 · Answer 3 · answered Sep 15 '17 at 18:05

0

A possible workaround is to specify the column names. Please refer my answer to a similar issue: https://stackoverflow.com/a/43145539/6466550

answered Sep 15 '17 at 18:05

computerist

872
8
9

Pandas CParserError: Error tokenizing data

3 Answers3

Linked