0

I am trying to convert a list of 2d-dataframes into one large dataframe. Lets assume I have the following example, where I create a set of dataframes, each one having the same columns / index:

import pandas as pd
import numpy as np

frames = []

names = []

frame_columns = ['DataPoint1', 'DataPoint2']

for i in range(5):
    names.append("DataSet{0}".format(i))

    frames.append(pd.DataFrame(np.random.randn(3, 2), columns=frame_columns))

I would like to convert this set of dataframes into one dataframe df which I can access using df['DataSet0']['DataPoint1'].

This dataset would have to have a multi-index consisting of the product of ['DataPoint1', 'DataPoint2'] and the index of the individual dataframes (which is of course the same for all individual frames).

Conversely, the columns would be given as the product of ['Dataset0', ...] and ['DataPoint1', 'DataPoint2'].

In either case, I can create a corresponding MultiIndex and derive an (empty) dataframe based on that:

mux = pd.MultiIndex.from_product([names, frames[0].columns])
frame = pd.DataFrame(index=mux).T

However, I would like to have the contents of the dataframes present rather than having to then add them.

Note that a similar question has been asked here. However, the answers seem to revolve around the Panel class, which is, as of now, deprecated. Similarly, this thread suggests a join, which is not really what I need.

hfhc2
  • 4,182
  • 2
  • 27
  • 56

1 Answers1

3

You can use concat with keys:

total_frame = pd.concat(frames, keys=names)

Output:

            DataPoint1  DataPoint2
DataSet0 0   -0.656758    1.776027
         1   -0.940759    1.355495
         2    0.173670    0.274525
DataSet1 0   -0.744456   -1.057482
         1    0.186901    0.806281
         2    0.148567   -1.065477
DataSet2 0   -0.980312   -0.487479
         1    2.117227   -0.511628
         2    0.093718   -0.514379
DataSet3 0    0.046963   -0.563041
         1   -0.663800   -1.130751
         2   -1.446891    0.879479
DataSet4 0    1.586213    1.552048
         1    0.196841    1.933362
         2   -0.545256    0.387289

Then you can extract Dataset0 by:

total_frame.loc['DataSet0']

If you really want to use MultiIndex columns instead, you can add axis=1 to concat:

total_frame = pd.concat(frames, axis=1, keys=names)
halfer
  • 19,824
  • 17
  • 99
  • 186
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Well, unfortunately, `pd.concat(frames, keys=names)["DataSet0"]` does not work, since "DataSet0" is not a column – hfhc2 May 20 '20 at 16:00
  • Thank you very much for the help :) Is there any (dis)-advantage to using MultiIndex columns? – hfhc2 May 20 '20 at 16:10
  • advantage is you don't need `loc` to access each. disavantage is It's a wide dataframe which is sometimes discouraged. TBH, I don't see much difference in this case. – Quang Hoang May 20 '20 at 16:13