4

I have 30 csv data files from 30 replicate runs of an experiment I ran. I am using pandas' read_csv() function to read the data into a list of DataFrames. I would like to create a single DataFrame out of this list, containing the average of the 30 DataFrames for each column. Is there a built-in way to accomplish this?

To clarify, I'll expand on the example in the answers below. Say I have two DataFrames:

>>> x
          A         B         C
0 -0.264438 -1.026059 -0.619500
1  0.927272  0.302904 -0.032399
2 -0.264273 -0.386314 -0.217601
3 -0.871858 -0.348382  1.100491
>>> y
          A         B         C
0  1.923135  0.135355 -0.285491
1 -0.208940  0.642432 -0.764902
2  1.477419 -1.659804 -0.431375
3 -1.191664  0.152576  0.935773

What is the merging function I should use to make a 3D array of sorts with the DataFrame? e.g.,

>>> automagic_merge(x, y)
                      A                      B                      C
0 [-0.264438,  1.923135] [-1.026059,  0.135355] [-0.619500, -0.285491]
1 [ 0.927272, -0.208940] [ 0.302904,  0.642432] [-0.032399, -0.764902]
2 [-0.264273,  1.477419] [-0.386314, -1.659804] [-0.217601, -0.431375]
3 [-0.871858, -1.191664] [-0.348382,  0.152576] [ 1.100491,  0.935773]

so I can calculate average, s.e.m., etc. on those lists instead of the entire column.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Randy Olson
  • 3,131
  • 2
  • 26
  • 39

3 Answers3

8

Check it out:

In [14]: glued = pd.concat([x, y], axis=1, keys=['x', 'y'])

In [15]: glued
Out[15]: 
          x                             y                    
          A         B         C         A         B         C
0 -0.264438 -1.026059 -0.619500  1.923135  0.135355 -0.285491
1  0.927272  0.302904 -0.032399 -0.208940  0.642432 -0.764902
2 -0.264273 -0.386314 -0.217601  1.477419 -1.659804 -0.431375
3 -0.871858 -0.348382  1.100491 -1.191664  0.152576  0.935773

In [16]: glued.swaplevel(0, 1, axis=1).sortlevel(axis=1)
Out[16]: 
          A                   B                   C          
          x         y         x         y         x         y
0 -0.264438  1.923135 -1.026059  0.135355 -0.619500 -0.285491
1  0.927272 -0.208940  0.302904  0.642432 -0.032399 -0.764902
2 -0.264273  1.477419 -0.386314 -1.659804 -0.217601 -0.431375
3 -0.871858 -1.191664 -0.348382  0.152576  1.100491  0.935773

In [17]: glued = glued.swaplevel(0, 1, axis=1).sortlevel(axis=1)

In [18]: glued
Out[18]: 
          A                   B                   C          
          x         y         x         y         x         y
0 -0.264438  1.923135 -1.026059  0.135355 -0.619500 -0.285491
1  0.927272 -0.208940  0.302904  0.642432 -0.032399 -0.764902
2 -0.264273  1.477419 -0.386314 -1.659804 -0.217601 -0.431375
3 -0.871858 -1.191664 -0.348382  0.152576  1.100491  0.935773

For the record, swapping the level and reordering was not necessary, just for visual purposes.

Then you can do stuff like:

In [19]: glued.groupby(level=0, axis=1).mean()
Out[19]: 
          A         B         C
0  0.829349 -0.445352 -0.452496
1  0.359166  0.472668 -0.398650
2  0.606573 -1.023059 -0.324488
3 -1.031761 -0.097903  1.018132
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
2

I figured out one way to do it.

pandas DataFrames can be added together with the DataFrame.add() function: http://pandas.sourceforge.net/generated/pandas.DataFrame.add.html

So I can add the DataFrames together then divide by the number of DataFrames, e.g.:

avgDataFrame = DataFrameList[0]

for i in range(1, len(DataFrameList)):
    avgDataFrame = avgDataFrame.add(DataFrameList[i])

avgDataFrame = avgDataFrame / len(DataFrameList)
Randy Olson
  • 3,131
  • 2
  • 26
  • 39
  • I like this answer because it can be used when averaging multiple and potentially large dataframes (combined with reading the incoming dataframes on the fly in order to save on memory). – rocarvaj Aug 04 '16 at 21:51
1

Have a look at the pandas.concat() function. When you read in your files, you can use concat to join the resulting DataFrames into one, then just use normal pandas averaging techniques to average them.

To use it, just pass it a list of the DataFrames you want joined together:

>>> x
          A         B         C
0 -0.264438 -1.026059 -0.619500
1  0.927272  0.302904 -0.032399
2 -0.264273 -0.386314 -0.217601
3 -0.871858 -0.348382  1.100491
>>> y
          A         B         C
0  1.923135  0.135355 -0.285491
1 -0.208940  0.642432 -0.764902
2  1.477419 -1.659804 -0.431375
3 -1.191664  0.152576  0.935773
>>> pandas.concat([x, y])
          A         B         C
0 -0.264438 -1.026059 -0.619500
1  0.927272  0.302904 -0.032399
2 -0.264273 -0.386314 -0.217601
3 -0.871858 -0.348382  1.100491
0  1.923135  0.135355 -0.285491
1 -0.208940  0.642432 -0.764902
2  1.477419 -1.659804 -0.431375
3 -1.191664  0.152576  0.935773
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Are you familiar enough with it to give a generic example? The tutorial at http://pandas.sourceforge.net/merging.html is rather confusing. – Randy Olson Jun 24 '12 at 01:44
  • Right right, that concats the two DataFrames together. But how would I use that concatenated DataFrame to create the average DataFrame? – Randy Olson Jun 24 '12 at 01:56
  • Perhaps I misunderstood your question. Can you edit your question to be more specific about what you mean by "the average DataFrame"? I thought you meant the average of all values contained in all the DataFrames. – BrenBarn Jun 24 '12 at 02:41
  • That's correct. I'm trying to create a DataFrame (which I call the "average DataFrame") which contains the averages of all the columns in my list of DataFrames. – Randy Olson Jun 24 '12 at 03:10
  • So once you have your new DataFrame `df` containing all the concatenated data, just do `df.mean()` like you would with any other DataFrame. – BrenBarn Jun 24 '12 at 03:14
  • I feel like this is really close, but there's one minor detail missing that I just realized I left out: I need the averages on a per-row basis. So in your example above, I would want the average of x["A"][0] and y["A"][0], x["A"][1] and y["A"][1], etc. – Randy Olson Jun 24 '12 at 03:40
  • If the documentation is confusing I would appreciate some help improving it (or more specific criticism!). – Wes McKinney Jun 24 '12 at 03:46
  • I made an edit to my original question to clarify what I'm still confused about. – Randy Olson Jun 24 '12 at 07:10
  • I'm starting to wonder if MultiIndex is the better solution. Anyone with more experience know if that's the right path to follow? – Randy Olson Jun 24 '12 at 17:30