How to identify bad records during read_csv using Python pandas?

Question

I have a data stored as a csv file with 250 plus columns and 700K records. I am encountering the parse error during read option. My objective is not resolve it but to debug/identify those error causing records

I already referred the posts here, here, here. So, it's not a duplicate

When I try the code below, I get the parse error as given below

df1 = pd.read_csv('New__Document.csv',low_memory=False)

ParserError: Error tokenizing data. C error: Expected 258 fields in line 14, saw 263

Based on this post, I followed the below suggestion and it works fine

 df = pd.read_csv('New__Document.csv',low_memory=False,on_bad_lines='skip')
 len(df)  # returns 365902 records

However, this results in loss of records. Therefore, I tried another suggestion

df1 = pd.read_csv('New__Document.csv',low_memory=False, sep='\t')
len(df1) # returns 762521 records.

But this doesn't display the output in tabular format (due to tab delimiter).

I would like to view the offending/bad records (obtained by subtracting = 396,619 records) in a neat tabular format. Because more than half of the dataset is lost due to this issue.

So, am seeking your help to know what is causing this issue? If I can identify those error causing records and store it in a table format, it would be helpful for me to review

score 1 · Answer 1 · answered Dec 16 '21 at 16:59

I suggest you write a script to first fix the faulty rows. You are first going to need to understand what is wrong with the invalid rows. First use the following script to display which rows do not contain the same number of rows as the header:

import csv

with open('New_Document.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    expected = len(header)
    
    for line_number, row in enumerate(csv_input, start=2):
        if len(row) != expected:
            print(line_number, row)

You then need to decide how you want to fix these rows, e.g. supply defaults for missing values. The following type of script could then be used to fix the faulty rows:

import csv

with open('New_Document.csv') as f_input, open('fixed.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    header = next(csv_input)
    expected = len(header)
    csv_output.writerow(header)
    
    for line_number, row in enumerate(csv_input, start=2):
        if len(row) != expected:
            # Add logic here to fix missing entries, this pads the row with zeros
            row.extend([0] * (expected - len(row)))
            print(f"Fixed line {line_number}")
    
        csv_output.writerow(row)

fixed.csv should then load correctly using Pandas

How to identify bad records during read_csv using Python pandas?

1 Answers1