0

Updated: (added sample records that are being skipped)

I am working on large CSV (almost 60GB or more) file that is delimited by double quotes. The issue I am having is that a column in the CSV file has double quotes on them as part of the data. So for those records, the read.csv is failing. 1. Is there a way to handle this situation without coding a parsing routine to clean-out that column? like a parameter in pandas read.csv that I maybe overlooking 2. If not, can it BE ignored and so it would read the rest of the file ? I tried to add error_bad_lines=False for this but it does not seem to work

Here is my current code using pandas read.csv to read this file:

csv_file_path = hdfsfile_root + "HOUSTON.DAT.gz"
delivery2=pd.read_csv(csv_file_path,sep=',',dtype=None,error_bad_lines=False)

Skipping line 2346214: expected 42 fields, saw 43

Sample data that is being encountered:

"2014-10-10 00:00:00.0000000","7751","367","BOY1D","01","DRIVER","10.7786","10.9267","1","12345678",""BONAVENTURE COMPANY, LLC","W012","07","GROUND","03","00","DATA NOT AVAILABLE","00","DATA NOT AVAILABLE","@@","6000","BERRY BROOK","DR","HOUSTON","TX","77017","KB","SIG OBTAINED","@@","M7","RECEIVER","0","0","0","0","0","29.6709","-95.2479","POSITIVE","DATA NOT AVAILABLE","3.4496e+007","3.4496e+007"

"2014-10-10 00:00:00.0000000","7751","377","BOY1E","01","DRIVER","10.7786","10.9267","1","12345678","SUSIE"S SO 40 CONFECTIONS","W018","07","GROUND","03","00","DATA NOT AVAILABLE","00","DATA NOT AVAILABLE","@@","6000","BERRY BROOK","DR","HOUSTON","TX","77017","KB","SIG OBTAINED","@@","M7","RECEIVER","0","0","0","0","0","29.6709","-95.2479","POSITIVE","DATA NOT AVAILABLE","3.4496e+007","3.4496e+007"
E B
  • 1,073
  • 3
  • 23
  • 36
  • Can you show an example line? If it is only for one column you might want to do pre processing. – Keith Dec 01 '17 at 18:05
  • @Keith, this is partial , but you can see the first column has double quote as part of the data , and it is available in different places .. and if i have to preprocess, have any idea on how to would be able to regex it to find and remove it ` "SUSIE"S SO 40 CONFECTIONS","6580E3 ","07","GROUND","03","00","DATA NOT AVAILABLE","00","DATA NOT AVAILABLE","@@","75","SOUTHBELT INDUSTRIAL","DR","HOUSTON","TX","77047","KB","SIG OBTAINED","@@","M7","RECEIVER","0","0","0","0","0","29.5878","-95.4242","POSITIVE","DATA NOT AVAILABLE","3.45533e+007","3.45533e+007"` – E B Dec 01 '17 at 18:34
  • @e-b, the sample you added in the comments is not really readable. Would you mind puttting it on the question itself? – LeoRochael Dec 01 '17 at 18:52
  • This might help https://stackoverflow.com/questions/35686920/reading-csv-from-pandas-having-both-quotechar-and-delimiter-for-a-column-value You may even get lucky and have quotechar = '" ' work – Keith Dec 01 '17 at 19:00
  • @Keith, tried both but it is still does not recognize that the the quote is part of the data and differentiate it from the delimited double quotes which starts and end a column.. – E B Dec 01 '17 at 19:08
  • Sorry I missed the extra single quote that surrounds the whole line. Try sep = ',' quotechar = ' " ' and lineterminator = ' ' ' or alternatively sep = ' "," ' and lineterminator = ' "' ' If none of this works go through all lines and remove your ' and " characters https://stackoverflow.com/questions/22770959/removing-characters-from-a-txt-file-using-python – Keith Dec 01 '17 at 19:28
  • @Keith, if i remove the ' and " but that would break the comma delimited isnt it since it is a comma-delimited file.. .. let me play around cleaning or pre-processing – E B Dec 01 '17 at 20:02
  • Leave to commas.... just take out the apostrophe and quotations. You do not need the quotations for the import to recognize your strings. In fact this will help since your number types would be recognized. The only issue is that SUSIE'S would appear as SUSIES which may or may not really matter to your work. – Keith Dec 01 '17 at 20:07
  • @Keith, that would not work if the name appears like "SUSIE, LLC" .. this would make SUSIE one column and LLC the other .. which also breaks the row – E B Dec 01 '17 at 20:10
  • hmmm this is an annoying problem. It almost seems like it is designed to be annoying. Can you change how the CSV is created so it is not so problematic? I do not think I can be more help. I would suggest just trying to do some combination of preprocessing and read_csv exploiting the sep quotechar lineterminator parameters. It is going to be a bunch of trial and error. Make a subsample of rows which contain all the problem cases to speed up iteration time. – Keith Dec 01 '17 at 21:29
  • 1
    @Keith .. i would have to go back to the application that ETL this file.. thanks for the feedback – E B Dec 18 '17 at 22:25

0 Answers0