1

i got a csv file: 22 Go size, 46000000 lines to save memory, csvfile is read and processed by chunk.

tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"') 
for chunk in tp: 
   chunk;

but the file is malformed and raise an exception :

Error tokenizing data. C error: Expected 87 fields in line 15092657, saw 162

is there a way to trash this chunk and continue the loop with next chunk ?

seb835
  • 366
  • 6
  • 16
  • Does it skip if you try this: `tp = pd.read_csv(f_in, sep=',', engine='c', chunksize=1000, encoding='utf-8',quotechar='"', error_bad_lines=False)?` – EdChum Dec 17 '14 at 13:36
  • will give a try, and get back to you with results. – seb835 Dec 17 '14 at 13:43

3 Answers3

1

The question is similar to an earlier asked one found here: Python Pandas Error tokenizing data

As it says in the answers you have to be aware that using error_bad_lines=False removes the line and suggests a better way is to investigate the line in your dataset.

Community
  • 1
  • 1
kristofferandreasen
  • 847
  • 2
  • 12
  • 24
1

As EdChum says, question was how to skip the chunk, and adding 'error_bad_lines=False' do the trick. Is there a way to intercept the trace giving bad lines and count faulty line ?

seb835
  • 366
  • 6
  • 16
  • From the docs: `warn_bad_lines : boolean, default True If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).` so if you turn `error_bad_lines=False` then it should output a warning which should contain the line number, you then have to inspect each of these warnings – EdChum Dec 18 '14 at 09:23
1

To intercept the bad line, i use the following code:

# somewhere to store output
err = StringIO.StringIO()
# save a reference to real stderr so we can restore later
oldstderr = sys.stderr
# set stderr to our StringIO instance
sys.stderr = err

tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"', error_bad_lines=False) 
for chunk in tp:
      chunk

# restore stderr 
sys.stderr = oldstderr

# print(or use) the stored value from previous print
print err.len + 'lines skipped.'
print err.getvalue()
err.close()
seb835
  • 366
  • 6
  • 16