pandas.read_csv with error_bad_lines=False seems to duplicates rows

Asked May 23 '17 at 11:31

Active May 23 '17 at 11:31

Viewed 2,078 times

While importing a csv with

import pandas as pd
test_df = pd.read_csv('test.csv',sep='\t')

I encountered error Error tokenizing data. C error: Expected 2 fields in line 173840, saw 3

As suggested here I applied

test_df = pd.read_csv('test.csv',sep='\t', error_bad_lines=False)

Instead of just skipping the problematic row, it seems that it started copying again from a random line (89465 in this case).

Actual data in the original csv:

actual data

Data copied from the csv:

weird error

Do you have any idea about why is this happening and what could I do to prevent it?

asked May 23 '17 at 11:31

Carlo

Simple check: what is the size of the df and of the file? – Roelant May 23 '17 at 11:37
You can also try to switch from the C to the Python engine, [`pd.read_csv('file.csv',engine='python')`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) – philshem May 23 '17 at 11:41
you can directly use `pd.read_table()` for tab separated files. – Karan Chudasama May 23 '17 at 11:46
@Roelant the file has 2617762 rows and occupies 34MB, but I had no problem at all with a similar file 9 times bigger. – Carlo May 23 '17 at 12:11
@KaranChudasama thanks, that worked fine :) But most of all I was curious about the bizarre result. – Carlo May 23 '17 at 12:13

0 Answers0