I'm trying to read segments of a CSV file into a pandas DataFrame, and I'm running into trouble when I set nrows to more than a certain point. My CSV file is split up into different segments with different headers/types of data, so I've gone through the file and found the line numbers of the different segments, and saved the line numbers. When I try to do:
pd.io.parsers.read_csv('filename',skiprows=40, nrows=12646)
It works fine. Any more rows, and it throws an error:
CParserError: Error tokenizing data. C error: Expected 56 fields in line 13897, saw 71
It's true that line 13897 has that many rows, that's why I'm trying to use nrows and skiprows. I can find the last row that pandas will read and it doesn't look any different from the rest. Looking at the file in a hex editor I still don't see any difference.
I've also tried it with another CSV file, and I get similar results:
pd.io.parsers.read_csv('file2',skiprows=112, nrows=18524)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18188 entries, 0 to 18187
But:
pd.io.parsers.read_csv('file2',skiprows=112, nrows=18525)
gives:
CParserError: Error tokenizing data. C error: Expected 56 fields in line 19190, saw 71
Is there something I'm missing? Is there another way to do this?
I'm using: pandas-0.10.1.win-amd64-py3.3
, numpy-MKL-1.7.1rc1.win-amd64-py3.3
, and python-3.3.0.amd64
on Windows. I get the same issue with numpy-unoptimized-1.7.1rc1.win-amd64-py3.3
.