I have a large csv file with 25 columns, that I want to read as a pandas dataframe. I am using pandas.read_csv()
.
The problem is that some rows have extra columns, something like that:
col1 col2 stringColumn ... col25
1 12 1 str1 3
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
When I try to read it, I get the error
CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28
The problem does not happen if the extra values appear in the first rows. For example if I add values to the third row of the same file it works fine
#that example works:
col1 col2 stringColumn ... col25
1 12 1 str1 3
2 12 1 str1 3
3 12 1 str1 3 f 4
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.
Skipping the offending lines like suggested here is not an option, those lines contain valuable information.
Does anybody know a way around this?