3

I need to daily ingest a CSV file into a pandas dataframe. The CSV has several thousand rows but every day I get a few records with more columns than expected. Let me give you an example. Take the following CSV:

Name, Address, Phone
John Doe, 777 Street, 3145678777
Jane Doe, 888 Street, 3145678888
Chris Doe, Adam Smith, 999 Street, 3145679999
Ellen Page, 222 Street, 3145679222

This is my import line:

df = pd.read_csv(myfile.csv, header = 0, names = ['Name, 'Address', 'Phone'])

As expected, row number 3 (Chris Doe, Adam Smith, 999 Street, 3145679999) breaks the process for it has an extra column. It seems that the application my source data is coming from allows users to enter commas in the Name field and, sometimes, they do so when multiple users share the same household. I cannot change the application.

What I am aiming at is simply detecting those rows and moving them to a separate text file or dataframe, anything that makes sense. I can deal with those records separately, that is fine.

I have seem some posts aiming at processing the CSV with different number of records. I reckon than can complicate my process and, in my case, it is not worthy.

I would appreciate any help on how to achieve this in the most straightforward and simple possible way.

Thanks!

Inian
  • 80,270
  • 14
  • 142
  • 161
Wilmar
  • 558
  • 1
  • 5
  • 16

4 Answers4

5

If you are okay with processing the bad records later, you can use error_bad_lines and warn_bad_lines while reading the csv file and save the row number of skipped records to a log file like this:

import contextlib

with open('bad_lines.txt', 'w') as log:
    with contextlib.redirect_stderr(log):
        df = pd.read_csv('output.csv', warn_bad_lines=True, error_bad_lines=False)

The above code will skip all bad lines and redirect the error lines to the log file which you can then use for reprocessing. Let me know if that helped.

Edit:

I got you a bit of a hacky solution if you don't want to reprocess the bad records. What I have done here is to read the three columns of the csv as one by using a different separator and then for every row, where the count of elements is greater than number of columns (3 in this case), keep only the last 2 values as is and then concatenate everything before them so no matter how many commas you get in the Name field, it should work:

df = pd.read_csv('output.csv', sep=';') # Notice the sep here

col_count = 3

def str_check(x):
    x = x.split(',')
    if len(x) > col_count:
        x = [', '.join(x[:-(col_count-1)])] + x[-(col_count-1):] 
        # Here the col_count is 3 so if you hardcode the values, 
        # it should look like [', '.join(x[:-2])] + x[-2:]
        # Join everything before the last two elements as one element
    
    return ';'.join(x)

df['Name, Address, Phone'] = df['Name, Address, Phone'].apply(str_check)
df = df['Name, Address, Phone'].str.split(';', expand=True)
df.columns = ['Name', 'Address', 'Phone']
df
Aditya
  • 1,357
  • 1
  • 9
  • 19
4
$ awk -F, 'NF>3' file
Chris Doe, Adam Smith, 999 Street, 3145679999

$ awk -F, '{print > (NF>3 ? "bad" : "good")}' file

$ head bad good
==> bad <==
Chris Doe, Adam Smith, 999 Street, 3145679999

==> good <==
Name, Address, Phone
John Doe, 777 Street, 3145678777
Jane Doe, 888 Street, 3145678888
Ellen Page, 222 Street, 3145679222

or you could just fix your CSV to quote the "Name" field if it contains commas:

$ awk -F, 'NF>3{rest=$(NF-1) FS $NF; sub(/(,[^,]*){2}$/,""); $0="\"" $0 "\"," rest} 1' file
Name, Address, Phone
John Doe, 777 Street, 3145678777
Jane Doe, 888 Street, 3145678888
"Chris Doe, Adam Smith", 999 Street, 3145679999
Ellen Page, 222 Street, 3145679222

or always:

$ awk -F, '{rest=$(NF-1) FS $NF; sub(/(,[^,]*){2}$/,""); $0="\"" $0 "\"," rest} 1' file
"Name", Address, Phone
"John Doe", 777 Street, 3145678777
"Jane Doe", 888 Street, 3145678888
"Chris Doe, Adam Smith", 999 Street, 3145679999
"Ellen Page", 222 Street, 3145679222

If you're using a very old, non-POSIX-compliant awk that doesn't support regexp intervals then you can use sub(/,[^,]*,[^,]*$/,"") instead of sub(/(,[^,]*){2}$/,"").

See What's the most robust way to efficiently parse CSV using awk? for more info on manipulating CSVs with awk.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
3

Here is one approach by iterating over the line in csv file and using regular expressions to to find all the matching field values corresponding to header name Name, Address and Phone

import re

data = []
regex = re.compile(r'(.*),\s?(.*),\s?(.*)')
with open('data.csv') as file:
    for line in file:
        v = regex.search(line)
        data += [v.groups() if v else []]

df = pd.DataFrame(data[1:], columns=data[0])

>>> df
                    Name     Address       Phone
0               John Doe  777 Street  3145678777
1               Jane Doe  888 Street  3145678888
2  Chris Doe, Adam Smith  999 Street  3145679999
3             Ellen Page  222 Street  3145679222
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
1

I would handle this in 2 ways:

1- Do some prior processing before python ingestion to separate the records based on the number of columns (reflected by the separator occurrence count) and then process each set of different columns separately in python:

## separate csv based on number of columns per row:
awk 'BEGIN{FS=","; OFS=","} {filename="outputfile_"NF"columns.csv"; print $0 > filename}' inputfile.csv

The above liner would take an input csv, count the number of columns and direct each record to a different file with the name "output_Xcolumns.csv" which you can then process in python

2- OR change your python code to, instead of loading directly into a df with pandas, load line by line as a list and append the list based on length to different Dfs.

I prefer method 1 since I think it would be more efficient.

dali
  • 11
  • 1