I have multiple CSV files which are formatted with multiple tables inside separated by line breaks.
Example:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
NUCLEAR 70.308020 33.374568
HYDRO_DAM_EXISTING 0.000000 195.051200
HYDRO_DAM 67.717942 1.271600
HYDRO_RIVER_EXISTING 0.000000 204.820000
IND_BOILER_OIL 2.053610 0.532362
IND_BOILER_COAL 4.179935 1.081855
IND_BOILER_WASTE 11.010126 2.849652
DEC_HP_ELEC 554.174644 320.791276
DEC_THERMAL_HP_GAS 77.077291 33.717477
DEC_BOILER_GAS 105.586089 41.161335
DEC_BOILER_OIL 33.514266 25.948450
H2_FROM_GAS 145.185290 59.178082
PYROLYSIS 132.200818 112.392123
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
HYDRO_STORAGE 0.000000 0.000000
Resource C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE 702.000000
DIESEL 96.390000
OIL 267.787558
NG 1648.527242
WOOD 592.110000
COAL 84.504083
URANIUM 18.277626
WASTE 0.000000
All my CSV files have different subtable names but few enough that I could enter them manually to detect them if required.
Another issue is that many titles include spaces (eg "Storage Technology") which is read by pandas as 2 columns.
I initially tried to do it directly with pandas and splitting manually but the argument on_bad_lines='skip'
which allows avoiding errors also skips useful lines:
Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]']
Cost_bd.columns = colnames
I believe it might be better to scan the .txt
file and split it but I'm unsure how to do this in the best way.
I have also tried to use the solution provided in this feed
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]
df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
but it doesn't work:
tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])
EDIT : Final solution based on @Rabinzel:
import re
def make_df(group,dict_of_dfs):
header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
if len(header) != len(data[0]): # If missing columns list, take former
header = header + dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
return dict_of_dfs
def Read_csv_as_df(path, file_name):
with open(path+file_name) as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
print(dict_of_dfs)
dict_of_dfs = make_df(group,dict_of_dfs)
group = []
dict_of_dfs = make_df(group,dict_of_dfs)
return dict_of_dfs