1

I have a CSV file that contains multiple tables. Each table has a title, and variable number of rows and columns (these numbers may vary between files). The titles, as well as names of rows and columns may also change between different files I will need to parse in the future, so I cannot hardcode them. some columns may contain empty cells as well.

Here is a screenshot of an example CSV file with this structure: enter image description here

I need to find a solution that will parse all the tables from the CSV into Pandas DFs. Ideally the final output would be an Excel file, where each table is saved as a sheet, and the name of each sheet will be the corresponding table title.

I tried the suggested solution in this post but it kept failing in identifying the start/end of the tables. When I used a simpler version of the input csv file, the suggested code only returned one table.

I would appreciate any assistance!!

Adi Lavy
  • 45
  • 9
  • 1
    If tables are separated by an empty row, you can just check whether the row is empty for every row you process. Once a row is empty, you can then start processing the next table – Alexandru Cristiean Jun 28 '21 at 19:01
  • Thank you @AlexandruCristiean! Could you please share a basic code example to point me in the right direction? – Adi Lavy Jun 28 '21 at 23:30

1 Answers1

1

You could try this:

df = pd.read_csv("file.csv")

dfs = []
start = 0
for i, row in df.iterrows():
    if all(row.isna()):  # Empty row
        # Remove empty columns
        temp_df = df.loc[start:i, :].dropna(how="all", axis=1)
        if start:  # Grab header, except for first df
            new_header = temp_df.iloc[0]
            temp_df = temp_df[1:]
            temp_df.columns = new_header
        temp_df = temp_df.dropna(how="all", axis=0)
        dfs.append(temp_df)
        start = i + 1

Then, you can reach each df by calling dfs[0], dfs[1], ...

Laurent
  • 12,287
  • 7
  • 21
  • 37
  • Thanks @Laurent! a couple of issues are that : 1. The code won't read the last table in the csv file. and 2. it adds an extra blank row to each table. – Adi Lavy Jun 29 '21 at 16:04
  • Thanks again! removing the extra blank row worked perfectly! However, I couldn't get the last table to parse. I checked and there's no extra blank row. Interestingly, if I add a blank row and a dummy data row after the last table, the script will read that table and ignore the dummy data row. – Adi Lavy Jun 29 '21 at 16:26
  • I was also wondering, why did you add the part that discards the first table? is that necessary? How should the code be modified to also parse the first table? – Adi Lavy Jun 29 '21 at 18:51
  • When I try it in its current form, the code doesn't t save the first df to the dfs list. – Adi Lavy Jun 29 '21 at 21:37
  • 1
    Indeed, sorry, `dfs.append(temp_df)` was not properly indented, see my updated answer. – Laurent Jun 30 '21 at 05:55
  • Thank you @Laurent!! for some strange reason it keeps skipping the last table in the CSV file... I'll keep digging. – Adi Lavy Jul 01 '21 at 17:35
  • @AdiLavy, did you find any solutions why it keeps skiping the last table? – Shihab Ullah Aug 05 '22 at 09:25
  • I did not... sorry – Adi Lavy Dec 20 '22 at 17:19