Malformed CSV file and Pandas read_csv by chunk

Question

i got a csv file: 22 Go size, 46000000 lines to save memory, csvfile is read and processed by chunk.

tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"') 
for chunk in tp: 
   chunk;

but the file is malformed and raise an exception :

Error tokenizing data. C error: Expected 87 fields in line 15092657, saw 162

is there a way to trash this chunk and continue the loop with next chunk ?

Does it skip if you try this: `tp = pd.read_csv(f_in, sep=',', engine='c', chunksize=1000, encoding='utf-8',quotechar='"', error_bad_lines=False)?` — EdChum, Dec 17 '14 at 13:36

score 1 · Answer 1 · edited May 23 '17 at 11:52

1

The question is similar to an earlier asked one found here: Python Pandas Error tokenizing data

As it says in the answers you have to be aware that using error_bad_lines=False removes the line and suggests a better way is to investigate the line in your dataset.

edited May 23 '17 at 11:52

Community

1
1

answered Dec 17 '14 at 18:00

kristofferandreasen

847
2
12
24

The OP is asking to skip the chunk, not specifically to investigate why the line fails – EdChum Dec 17 '14 at 18:02

score 1 · Answer 2 · answered Dec 17 '14 at 22:31

1

As EdChum says, question was how to skip the chunk, and adding 'error_bad_lines=False' do the trick. Is there a way to intercept the trace giving bad lines and count faulty line ?

answered Dec 17 '14 at 22:31

seb835

366
6
16

From the docs: `warn_bad_lines : boolean, default True If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).` so if you turn `error_bad_lines=False` then it should output a warning which should contain the line number, you then have to inspect each of these warnings – EdChum Dec 18 '14 at 09:23

score 1 · Answer 3 · answered Dec 18 '14 at 15:58

To intercept the bad line, i use the following code:

# somewhere to store output
err = StringIO.StringIO()
# save a reference to real stderr so we can restore later
oldstderr = sys.stderr
# set stderr to our StringIO instance
sys.stderr = err

tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"', error_bad_lines=False) 
for chunk in tp:
      chunk

# restore stderr 
sys.stderr = oldstderr

# print(or use) the stored value from previous print
print err.len + 'lines skipped.'
print err.getvalue()
err.close()

Malformed CSV file and Pandas read_csv by chunk

3 Answers3