Split CSV file which contains multiple tables into different pandas dataFrames (Python)

Question

I have multiple CSV files which are formatted with multiple tables inside separated by line breaks.

Example:

Technology  C_inv [MCHF/y]  C_maint [MCHF/y]
NUCLEAR 70.308020   33.374568
HYDRO_DAM_EXISTING  0.000000    195.051200
HYDRO_DAM   67.717942   1.271600
HYDRO_RIVER_EXISTING    0.000000    204.820000
IND_BOILER_OIL  2.053610    0.532362
IND_BOILER_COAL 4.179935    1.081855
IND_BOILER_WASTE    11.010126   2.849652
DEC_HP_ELEC 554.174644  320.791276
DEC_THERMAL_HP_GAS  77.077291   33.717477
DEC_BOILER_GAS  105.586089  41.161335
DEC_BOILER_OIL  33.514266   25.948450
H2_FROM_GAS 145.185290  59.178082
PYROLYSIS   132.200818  112.392123

Storage technology  C_inv [MCHF/y]  C_maint [MCHF/y]
HYDRO_STORAGE   0.000000    0.000000

Resource    C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE    702.000000
DIESEL  96.390000
OIL 267.787558
NG  1648.527242
WOOD    592.110000
COAL    84.504083
URANIUM 18.277626
WASTE   0.000000

All my CSV files have different subtable names but few enough that I could enter them manually to detect them if required.

Another issue is that many titles include spaces (eg "Storage Technology") which is read by pandas as 2 columns.

I initially tried to do it directly with pandas and splitting manually but the argument on_bad_lines='skip' which allows avoiding errors also skips useful lines:

Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]'] 
Cost_bd.columns = colnames

I believe it might be better to scan the .txt file and split it but I'm unsure how to do this in the best way.

I have also tried to use the solution provided in this feed

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]

df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}

but it doesn't work:

tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])

EDIT : Final solution based on @Rabinzel:

import re
def make_df(group,dict_of_dfs):
    header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
    if len(header) != len(data[0]): # If missing columns list, take former
        header  = header + dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
    dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
    return dict_of_dfs

def Read_csv_as_df(path, file_name):
    
    with open(path+file_name) as f:
            dict_of_dfs = {}
            group = []
            for line in f:
                if line!='\n':
                    group.append(line.strip())
                else:
                    print(dict_of_dfs)
                    dict_of_dfs = make_df(group,dict_of_dfs)
                    group = []
            dict_of_dfs = make_df(group,dict_of_dfs)
    return dict_of_dfs

Rabinzel · Accepted Answer · 2022-10-26T04:22:58.903

2

I would do it the following way. Iterate through each row, append each chunk seperated by a newline to a list and build dataframes from the lists. The problem with the column names with spaces, use re.split and split only if there are two or more spaces. Save the different df's in a dictionary where the key is the first element of the header of each df.

import re
def make_df(group):
    header, data = re.split(r'\s\s+',group[0]), list(map(str.split, group[1:]))
    dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)

with open('your_csv_file.csv') as f:
    dict_of_dfs = {}
    group = []
    for line in f:
        if line!='\n':
            group.append(line.strip())
        else:
            make_df(group)
            group = []
    make_df(group)

for key, value in dict_of_dfs.items():
    print(f"{key=}\ndf:\n{value}\n---------------------")

Output:

key='Technology'
df:
              Technology C_inv [MCHF/y] C_maint [MCHF/y]
0                NUCLEAR      70.308020        33.374568
1     HYDRO_DAM_EXISTING       0.000000       195.051200
2              HYDRO_DAM      67.717942         1.271600
3   HYDRO_RIVER_EXISTING       0.000000       204.820000
4         IND_BOILER_OIL       2.053610         0.532362
5        IND_BOILER_COAL       4.179935         1.081855
6       IND_BOILER_WASTE      11.010126         2.849652
7            DEC_HP_ELEC     554.174644       320.791276
8     DEC_THERMAL_HP_GAS      77.077291        33.717477
9         DEC_BOILER_GAS     105.586089        41.161335
10        DEC_BOILER_OIL      33.514266        25.948450
11           H2_FROM_GAS     145.185290        59.178082
12             PYROLYSIS     132.200818       112.392123
---------------------
key='Storage technology'
df:
  Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
0      HYDRO_STORAGE       0.000000         0.000000
---------------------
key='Resource'
df:
      Resource C_op [MCHF/y]
0  ELECTRICITY   1174.452848
1     GASOLINE    702.000000
2       DIESEL     96.390000
3          OIL    267.787558
4           NG   1648.527242
5         WOOD    592.110000
6         COAL     84.504083
7      URANIUM     18.277626
8        WASTE      0.000000
---------------------

edited Oct 26 '22 at 04:22

answered Oct 25 '22 at 17:58

Rabinzel

7,757
3
10
30

In function `make_df` you can have `key = header[0]` to make the key names the table names. – DarrylG Oct 25 '22 at 20:43
thanks, that makes propably more sense. I updated my answer. – Rabinzel Oct 25 '22 at 21:05
Thank you for your quick answer! I ended up implementing a solution but yours is much cleaner. I tried to run yours but it returns me that error: ValueError: 1 columns passed, passed data had 3 columns Do you know where the issue would be? – OliviaJ Oct 27 '22 at 07:08
I guess in your data you have columns (I think headers) where between elements isn't 2 or more spaces, then they won't get splitted. Pandas expects a list of 3 names (3 columns) but only got one big string with 3 names (1 column) – Rabinzel Oct 27 '22 at 07:22
Though it is the same file that I posted here. How do I fix it ? – OliviaJ Oct 29 '22 at 06:40
Could you run my code again and add `print(line)` directly after `for line in f`. Then we see at which line in the file the code throws the error. – Rabinzel Oct 29 '22 at 06:55
... PYROLYSIS 132.200818 112.392123 Right after: AssertionError: 1 columns passed, passed data had 2 columns. So the error pops up at the second table – OliviaJ Oct 30 '22 at 10:24
Ok. I took another look on how you provided the example data. I saved it as text file, and there I have 2 spaces between "Storage technology" and "C_inv [MCHF/y]" but maybe in your data the names are seperated by `\t`. Try to run my code with a little change from `re.split(r'\s\s+',group[0])` to `re.split(r'\t',group[0])`. Maybe that helps. – Rabinzel Oct 30 '22 at 10:38
It works! I also had to change `make_df(group)` to make it work as a function `dict_of_dfs = make_df(group,dict_of_dfs)`. Thanks a lot ! – OliviaJ Oct 30 '22 at 11:17
@OliviaJ nice, happy to hear! You can thank me by accepting the answer :) – Rabinzel Oct 30 '22 at 12:22
1

Done! Sorry, I'm new to StackOverflow. I put also my latest version with a slight twist when we don't have a repetition of columns names as Edit to my post. – OliviaJ Oct 30 '22 at 12:37

Split CSV file which contains multiple tables into different pandas dataFrames (Python)

1 Answers1