My question is very similar to this one, with one addition. I have thousands of Pandas data frames I wish to concatenate. If I use the following command I have a memory error:
concat_df = pd.concat([df1, df2, df3,..])
So I was thinking of using the "bypickle" solution presented here. However, it does not work as my dfs have different columns and I don't know these colums, they change everytime. Consider the following example (and imagine I don't know the columns before running the script):
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'k', 'c'])
I would like to get a final result like this one, with the most efficient solution.
pd.concat([df1,df2])
a b c k
0 1 2.0 3 NaN
1 4 5.0 6 NaN
2 7 8.0 9 NaN
0 1 NaN 3 2.0
1 4 NaN 6 5.0
2 7 NaN 9 8.0
EDIT - MY SOLUTION SO FAR
def bycsv(dfs):
md,hd='w',True
for df in dfs:
df.to_csv('df_all.csv',mode=md,header=hd,index=None)
md,hd='a',False
#del dfs
df_all=pd.read_csv('df_all.csv',index_col=None)
os.remove('df_all.csv')
return df_all
cols = []
for df in [df1,df2]:
cols += list(df)
cols = sorted(list(set(cols)))
dfsok = [df.reindex(columns=colonne) for df in [df1,df2]]
bycsv(dfsok)
Anything more efficient?