Reading parts of ~13000 row CSV file with pandas read_csv and nrows

Question

I'm trying to read segments of a CSV file into a pandas DataFrame, and I'm running into trouble when I set nrows to more than a certain point. My CSV file is split up into different segments with different headers/types of data, so I've gone through the file and found the line numbers of the different segments, and saved the line numbers. When I try to do:

pd.io.parsers.read_csv('filename',skiprows=40, nrows=12646)

It works fine. Any more rows, and it throws an error:

CParserError: Error tokenizing data. C error: Expected 56 fields in line 13897, saw 71

It's true that line 13897 has that many rows, that's why I'm trying to use nrows and skiprows. I can find the last row that pandas will read and it doesn't look any different from the rest. Looking at the file in a hex editor I still don't see any difference.

I've also tried it with another CSV file, and I get similar results:

pd.io.parsers.read_csv('file2',skiprows=112, nrows=18524)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18188 entries, 0 to 18187

But:

pd.io.parsers.read_csv('file2',skiprows=112, nrows=18525)

gives:

CParserError: Error tokenizing data. C error: Expected 56 fields in line 19190, saw 71

Is there something I'm missing? Is there another way to do this?

I'm using: pandas-0.10.1.win-amd64-py3.3, numpy-MKL-1.7.1rc1.win-amd64-py3.3, and python-3.3.0.amd64 on Windows. I get the same issue with numpy-unoptimized-1.7.1rc1.win-amd64-py3.3.

Is there something fishy with this line with it like it has 70 commas where every previous line has 55...? — Andy Hayden, Apr 05 '13 at 17:39
The line the error is referring to is one with 70 commas, yes. But with the skiprows and nrows, I'm trying to prevent it from reaching that line. For example, when the error refers to line 13897, I'm trying to read from lines 40 to 12647+40. The rows I'm trying to specify are normal (55 fields). — dooz, Apr 05 '13 at 17:42

score 3 · Answer 1 · answered Apr 05 '13 at 21:22

3

You can use warn_bad_lines and error_bad_lines to turn off bad line error & warning:

import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=False, error_bad_lines=False)

answered Apr 05 '13 at 21:22

HYRY

94,853
25
187
187

This seems to work but I'm still wondering why it threw the error in the first place. I wrote a workaround solution using StringIO buffers in the mean time, but like I said I wonder why it's giving me an error for a line it's not being told to read. pd.read_csv reads the (same) data fine (without the bad_lines flags) from the StringIO made up of the lines of the file I'm trying to tell it to read with nrows and skiprows. – dooz Apr 05 '13 at 22:52

Reading parts of ~13000 row CSV file with pandas read_csv and nrows

1 Answers1

Linked