50

I have a long table of data (~200 rows by 50 columns) and I need to create a code that can calculate the mean values of every two rows and for each column in the table with the final output being a new table of the mean values. This is obviously crazy to do in Excel! I use python3 and I am aware of some similar questions:here, here and here. But none of these helps as I need some elegant code to work with multiple columns and produces an organised data table. By the way my original datatable has been imported using pandas and is defined as a dataframe but could not find an easy way to do this in pandas. Help is much appreciated.

An example of the table (short version) is:

a   b   c   d
2   50  25  26
4   11  38  44
6   33  16  25
8   37  27  25
10  28  48  32
12  47  35  45
14  8   16  7
16  12  16  30
18  22  39  29
20  9   15  47

Expected mean table:

a    b     c     d
3   30.5  31.5  35
7   35    21.5  25
11  37.5  41.5  38.5
15  10    16    18.5
19  15.5  27    38
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
Gnu
  • 611
  • 1
  • 6
  • 5

6 Answers6

61

You can create an artificial group using df.index//2 (or as @DSM pointed out, using np.arange(len(df))//2 - so that it works for all indices) and then use groupby:

df.groupby(np.arange(len(df))//2).mean()
Out[13]: 
      a     b     c     d
0   3.0  30.5  31.5  35.0
1   7.0  35.0  21.5  25.0
2  11.0  37.5  41.5  38.5
3  15.0  10.0  16.0  18.5
4  19.0  15.5  27.0  38.0
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 2
    I found something close [here](stackoverflow.com/questions/36810595/calculate-average-of-every-x-rows-in-a-table-and-create-new-table) BUT your answer if very elegant and compact. Thank you so much! Just out of interest what does the first forward slash means in df.index//2? – Gnu Apr 23 '16 at 12:20
  • You are welcome. That is for integer division so that both 2//2 and 3//2 yield 1, 4//2 and 5//2 yield 2... (and put into the same group). – ayhan Apr 23 '16 at 12:21
  • Got it! Thanks a lot! – Gnu Apr 23 '16 at 12:24
  • 3
    It might be better to group on `np.arange(len(df))//2` instead, in case the index isn't simply 0,1,2.. etc. – DSM Apr 23 '16 at 13:45
  • 1
    And a bit faster is `df.groupby(np.arange(len(df.index))//2).mean()` ;) – jezrael Jun 01 '17 at 11:08
33

You can approach this problem using pd.rolling() to create a rolling average and then just grab every second element using iloc

df = df.rolling(2).mean() 
df = df.iloc[::2, :]

Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.

seeiespi
  • 3,628
  • 2
  • 35
  • 37
9

NumPythonic way would be to extract the elements as a NumPy array with df.values, then reshape to a 3D array with 2 elements along axis=1 and 4 along axis=2 and perform the average reduction along axis=1 and finally convert back to a dataframe, like so -

pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))

As it turns out, you can introduce NumPy's very efficient tool : np.einsum to do this average-reduction as a combination of sum-reduction and scaling-down, like so -

pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)

Please note that the proposed approaches assume that the number of rows is divisible by 2.

Also as noted by @DSM, to preserve the column names, you need to add columns=df.columns when converting back to Dataframe, i.e. -

pd.DataFrame(...,columns=df.columns)

Sample run -

>>> df
    0   1   2   3
0   2  50  25  26
1   4  11  38  44
2   6  33  16  25
3   8  37  27  25
4  10  28  48  32
5  12  47  35  45
6  14   8  16   7
7  16  12  16  30
8  18  22  39  29
9  20   9  15  47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0

Runtime tests -

In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby.

In [24]: A = np.random.randint(0,9,(200,50))

In [25]: df = pd.DataFrame(A)

In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop

In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 µs per loop

In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 µs per loop
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This one is ~2x faster than groupby. Nice. – ayhan Apr 23 '16 at 12:31
  • 1
    @ayhan Yeah I noticed that, was about to post runtimes, but then final thought was to let it pass :) Yours is a new thing I learned though, pandas not being my thing! – Divakar Apr 23 '16 at 12:32
  • Note that your numpy approaches lose the column names. You could add `columns=df.columns` to correct this. – DSM Apr 23 '16 at 13:36
  • @DSM Yeah, totally missed that! Thanks! Added that as a note. – Divakar Apr 23 '16 at 13:42
5
df.set_index(np.arange(len(df)) // 2).mean(level=0)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
3

In your case, as you want to average the rows, assuming your dataframe name is new

new = new.groupby(np.arange(len(new)) // 2).mean() 

If one wants to do the average for the columns

new = new.groupby(np.arrange(len(new.columns)) // 2, axis=1).mean()
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
1

I got ValueError: Grouper and axis must be same length when I tried using numpy to create the artificial group. As an alternative, you can use itertools which will generate an iterator of equal length to your Dataframe:

SAMPLE_SIZE = 2
label_series = pd.Series(itertools.chain.from_iterable(itertools.repeat(x, SAMPLE_SIZE) for x in df.index))
sampled_df = df.groupby(label_series).mean()
micmalti
  • 561
  • 3
  • 16