0

My question is very similar to this one, with one addition. I have thousands of Pandas data frames I wish to concatenate. If I use the following command I have a memory error:

concat_df = pd.concat([df1, df2, df3,..])

So I was thinking of using the "bypickle" solution presented here. However, it does not work as my dfs have different columns and I don't know these colums, they change everytime. Consider the following example (and imagine I don't know the columns before running the script):

df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'k', 'c'])

I would like to get a final result like this one, with the most efficient solution.

pd.concat([df1,df2])
    a   b   c   k
0   1   2.0     3   NaN
1   4   5.0     6   NaN
2   7   8.0     9   NaN
0   1   NaN     3   2.0
1   4   NaN     6   5.0
2   7   NaN     9   8.0

EDIT - MY SOLUTION SO FAR

def bycsv(dfs):
  md,hd='w',True
  for df in dfs:
    df.to_csv('df_all.csv',mode=md,header=hd,index=None)
    md,hd='a',False
  #del dfs
  df_all=pd.read_csv('df_all.csv',index_col=None)
  os.remove('df_all.csv') 
  return df_all
  
cols = []
for df in [df1,df2]:
  cols += list(df)
cols = sorted(list(set(cols)))

dfsok = [df.reindex(columns=colonne) for df in [df1,df2]]

bycsv(dfsok)

Anything more efficient?

Forinstance
  • 413
  • 4
  • 17
  • On the thread you mention there is an approach that tries to write each of the dataframes with the same as csv and then you read that single csv. Have you gaven that a try? – Borja_042 Nov 22 '22 at 08:47
  • Yep, but they also say the CSV method is pretty inefficient and I have thousands of dfs. So I was looking for a more efficient solution (for example using pickle or HDF), which works in the case of changing columns across dfs. – Forinstance Nov 22 '22 at 08:50
  • Why dont you try installing pyarrow and exporting and reading into .parquet format? That would save disk space and hopefully it will run faster. – Borja_042 Nov 22 '22 at 09:54
  • did not know about this solution, thank you :) Will give it a try. Which would be the code in that case? – Forinstance Nov 22 '22 at 12:20
  • I think before using pandas parquet format you need to install pyarrow, conda install -c conda-forge pyarrow and then write the dfs using parquet df.to_parquet(write_dir, compression='gzip') and once you have done it just read it normally using pd.read_parquet() – Borja_042 Nov 22 '22 at 12:37
  • Did you give it a try? – Borja_042 Nov 23 '22 at 09:42

0 Answers0