0

An extension to Problems reading CSV file with commas and characters in pandas

Solution provided in the above link works if and only if one column which contains commas as its characters and rest of the columns are well behaved.

What if there are more than one column with above issue?

Example CSV content with additional commas issue :

Name,Age,Address,Phone,Qualification
Suresh,28,Texas,3334567892,B.Tech
Ramesh,24,NewYork, NY,8978974040,9991111234,Ph.D
Mukesh,26,Dallas,4547892345,Ph.D

Required Output Pandas DataFrame:

Name    Age  Address      Phone                  Qualification
Suresh  28   Texas        3334567892             B.Tech
Ramesh  24   NewYork, NY  8978974040,9991111234  Ph.D
Mukesh  26   Dallas       4547892345             Ph.D

Edited :

Input file with commas as characters in successive columns :

Name,Age,Address,Qualification,Grade                  
Suresh,28,Texas,B.Tech,Ph.D,A
Ramesh,24,NewYork, NY,B.Tech,A+
Mukesh,26,Dallas,B.Tech,Ph.D,A

Required Output Pandas DataFrame:

Name    Age  Address      Qualification Grade                  
Suresh  28   Texas        B.Tech,Ph.D   A
Ramesh  24   NewYork, NY  B.Tech        A+
Mukesh  26   Dallas       B.Tech,Ph.D   A

Can I get any suggestions to solve this issue?

Thanks in Advance!!!

Avinash Clinton
  • 543
  • 1
  • 8
  • 19

2 Answers2

3

One way to do this would be to have " to clearly separate your data -

Name,Age,Address,Phone,Qualification
Suresh,28,Texas,3334567892,B.Tech
Ramesh,24,"NewYork, NY","8978974040,9991111234",Ph.D
Mukesh,26,Dallas,4547892345,Ph.D

If this isn't there, pandas will struggle to read it right.

Copy the above data, do a pd.read_clipboard(sep=',') and it will yield -

     Name  Age      Address                  Phone Qualification
0  Suresh   28        Texas             3334567892        B.Tech
1  Ramesh   24  NewYork, NY  8978974040,9991111234          Ph.D
2  Mukesh   26       Dallas             4547892345          Ph.D

If modifying the source data as a whole is not within your means-

A practical approach would be to do a usual read_csv with error_bad_lines=False. Once done, look through the logs and make a note of the lines that pandas is struggling to read and modify only those lines accordingly.

Hope this helps.

Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
2

Your data appears fixed for the first two columns and also the last, so these can be removed and the remaining values could be processed using itertools.groupby() to group the remaining columns into numeric or non-numeric groups. The resulting data could then be loaded into pandas:

import pandas as pd
from itertools import groupby
import csv

data = []

with open('input.csv', newline='') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)

    for row in csv_input:
        addr_phone = [','.join(g) for k, g in groupby(row[2:-1], lambda x: x.isdigit())]
        data.append(row[:2] + addr_phone + [row[-1]])

df = pd.DataFrame(data, columns=header)        
print(df)

Giving you:

     Name Age      Address                  Phone Qualification
0  Suresh  28        Texas             3334567892        B.Tech
1  Ramesh  24  NewYork, NY  8978974040,9991111234          Ph.D
2  Mukesh  26       Dallas             4547892345          Ph.D

To work with your second example, you would have to decide on a way to split the two columns. I would suggest you create a list of possible qualifications. When there is a match, you would be able to split at that point. For example:

import pandas as pd
import csv

def find_split(data):
    for index, v in enumerate(data):
        if v.lower() in ['b.tech', 'ph.d']:
            return [', '.join(data[:index]), ', '.join(data[index:])]
    return [', '.join(data), '']

data = []

with open('input.csv', newline='') as f_input:
    csv_input = csv.reader(f_input, skipinitialspace=True)
    header = next(csv_input)

    for row in csv_input:
        data.append(row[:2] + find_split(row[2:-1]) + [row[-1]])

df = pd.DataFrame(data, columns=header)        
print(df)

Giving you:

     Name Age      Address Qualification Grade
0  Suresh  28        Texas  B.Tech, Ph.D     A
1  Ramesh  24  NewYork, NY        B.Tech    A+
2  Mukesh  26       Dallas  B.Tech, Ph.D     A

You could create a list of qualifications by first creating a set() based on the contents of row[2] (lowercased). Print the contents of the set and then add that to the script and rerun it.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • That works great!!! But what if I have a column(similar to address column with commas as it's characters in some lines) next to address column instead of Phone column??I mean if two non numeric columns with same issue are present in the data set next to each other, then It will be difficult to differenciate them using above solution right? – Avinash Clinton May 14 '18 at 12:04
  • Please [edit] your question to include another example showing this. It is though not possible to guess which data belongs to which column in a completely generic way. The best solution would be to ensure the data is correctly escaped when it is created in the first place. Perhaps you could show how the data is being created? – Martin Evans May 14 '18 at 12:18
  • The input data is provided by some third party source, so it's not possible to escape it while creation and this behavior is very rare, so there should be a way to handle it with some custom code snippet, since I cannot ignore these lines. – Avinash Clinton May 16 '18 at 05:59
  • You are probably going to need a different approach for each type of file. I have updated the answer to solve your 2nd example. – Martin Evans May 16 '18 at 10:59