2

I am running the following code in python, using Pandas, to read through various csv files.

def create_formatted(csv_files):
    for f in csv_files:
        df = pd.read_csv(f)

However, the header of each file could start on different rows, with all files having an initial 3-5 rows with just a single column of data, before the header starts - at which point the number of fields increases from 1 to 13. An example would be:

CSV File 1:

| Row 1 |

| Row 2 |

| Row 3 |

| Header A | Header B | Header c |

| -------- | -------- | -------- |

| Cell 1 | Cell 2 | Cell 3 |

| Cell 4 | Cell 5 | Cell 6 |


CSV File 2:

| Row 1 |

| Row 2 |

| Row 3 |

| Row 4 |

| Header A | Header B | Header c |

| -------- | -------- | -------- |

| Cell 1 | Cell 2 | Cell 3 |

| Cell 4 | Cell 5 | Cell 6 |

I have tried using the following 'skiprows=' parameter, but obviously it won't work as the header can start on a different row in each csv file.

def create_formatted(csv_files):
    for f in csv_files:
        df = pd.read_csv(f, skiprows=3)

Is there another work around I can try to ignore the first few lines, until the header row starts?

  • does this help? https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data – Olasimbo Feb 22 '23 at 08:53
  • Thank you @OlasimboArigbabu I have tried 'error_bad_lines=False' and this ignores the lines I want it to read (from the header row onwards), and returns the first lines I want it to ignore (everything before the header row) – Matthew Clifford Feb 22 '23 at 09:08

1 Answers1

1

I see two alternatives depending on where your data comes from and how you want to use it. It's important to undertand how csv (comma seperated values) files work. The error you are getting is because your data looks like this:

1\n
2\n
3\n
4\n
5\n
6\n
7\n
8\n
A,1,10\n
B,2,11\n
C,3,12\n
A,4,13\n
B,5,14\n
C,6,15\n
A,7,16\n
B,8,17\n
C,9,18\n
A,10,19\n
B,11,20\n
C,12,21\n
A,13,22\n
B,14,23\n
C,15,24\n

So Pandas starts of line one by thinking that the file will only contain one row, since it deduces that by doing something like len(first_row.split(',')) + 1. So either you clean up your data to look like this:

1,,\n
2,,\n
3,,\n
4,,\n
5,,\n
6,,\n
7,,\n
8,,\n
A,1,10\n
B,2,11\n
C,3,12\n
A,4,13\n
B,5,14\n
C,6,15\n
A,7,16\n
B,8,17\n
C,9,18\n
A,10,19\n
B,11,20\n
C,12,21\n
A,13,22\n
B,14,23\n
C,15,24\n

Or you make some kind of workaround method like this:

def preParse(csv_as_txt, delimiter):
    col_count = max(len(row.split(delimiter)) for row in csv_as_txt.split('\n'))
    cleand = [row for row in csv_as_txt.split('\n') if len(row.split(delimiter)) == col_count]
    return cleand

def create_formatted(csv_files):
    for f in csv_files:
        df = pd.read_csv(f)
    print(df)

to_be_cleaned = open('test.csv','r')
cleand = preParse(to_be_cleaned.read(),',')
to_be_cleaned.close()


create_formatted(['test.csv'])

In my opinion it is better to clean up the data with the first alternative since in the second case, you will loose all rows that are not as long as the biggest one. In case you data is very dirty you could loose a lot.

I hope this helps someone!

Olivier Neve
  • 299
  • 10