Updated: (added sample records that are being skipped)
I am working on large CSV (almost 60GB or more) file that is delimited by double quotes. The issue I am having is that a column in the CSV file has double quotes on them as part of the data. So for those records, the read.csv is failing.
1. Is there a way to handle this situation without coding a parsing routine to clean-out that column? like a parameter in pandas read.csv that I maybe overlooking
2. If not, can it BE ignored and so it would read the rest of the file ?
I tried to add error_bad_lines=False
for this but it does not seem to work
Here is my current code using pandas read.csv to read this file:
csv_file_path = hdfsfile_root + "HOUSTON.DAT.gz"
delivery2=pd.read_csv(csv_file_path,sep=',',dtype=None,error_bad_lines=False)
Skipping line 2346214: expected 42 fields, saw 43
Sample data that is being encountered:
"2014-10-10 00:00:00.0000000","7751","367","BOY1D","01","DRIVER","10.7786","10.9267","1","12345678",""BONAVENTURE COMPANY, LLC","W012","07","GROUND","03","00","DATA NOT AVAILABLE","00","DATA NOT AVAILABLE","@@","6000","BERRY BROOK","DR","HOUSTON","TX","77017","KB","SIG OBTAINED","@@","M7","RECEIVER","0","0","0","0","0","29.6709","-95.2479","POSITIVE","DATA NOT AVAILABLE","3.4496e+007","3.4496e+007"
"2014-10-10 00:00:00.0000000","7751","377","BOY1E","01","DRIVER","10.7786","10.9267","1","12345678","SUSIE"S SO 40 CONFECTIONS","W018","07","GROUND","03","00","DATA NOT AVAILABLE","00","DATA NOT AVAILABLE","@@","6000","BERRY BROOK","DR","HOUSTON","TX","77017","KB","SIG OBTAINED","@@","M7","RECEIVER","0","0","0","0","0","29.6709","-95.2479","POSITIVE","DATA NOT AVAILABLE","3.4496e+007","3.4496e+007"