Calculate average of every x rows in a table and create new table

Question

I have a long table of data (~200 rows by 50 columns) and I need to create a code that can calculate the mean values of every two rows and for each column in the table with the final output being a new table of the mean values. This is obviously crazy to do in Excel! I use python3 and I am aware of some similar questions:here, here and here. But none of these helps as I need some elegant code to work with multiple columns and produces an organised data table. By the way my original datatable has been imported using pandas and is defined as a dataframe but could not find an easy way to do this in pandas. Help is much appreciated.

An example of the table (short version) is:

a   b   c   d
2   50  25  26
4   11  38  44
6   33  16  25
8   37  27  25
10  28  48  32
12  47  35  45
14  8   16  7
16  12  16  30
18  22  39  29
20  9   15  47

Expected mean table:

a    b     c     d
3   30.5  31.5  35
7   35    21.5  25
11  37.5  41.5  38.5
15  10    16    18.5
19  15.5  27    38

ayhan · Answer 1 · 2016-04-23T13:54:34.477

61

You can create an artificial group using df.index//2 (or as @DSM pointed out, using np.arange(len(df))//2 - so that it works for all indices) and then use groupby:

df.groupby(np.arange(len(df))//2).mean()
Out[13]: 
      a     b     c     d
0   3.0  30.5  31.5  35.0
1   7.0  35.0  21.5  25.0
2  11.0  37.5  41.5  38.5
3  15.0  10.0  16.0  18.5
4  19.0  15.5  27.0  38.0

edited Apr 23 '16 at 13:54

answered Apr 23 '16 at 12:13

ayhan

70,170
20
182
203

2

I found something close [here](stackoverflow.com/questions/36810595/calculate-average-of-every-x-rows-in-a-table-and-create-new-table) BUT your answer if very elegant and compact. Thank you so much! Just out of interest what does the first forward slash means in df.index//2? – Gnu Apr 23 '16 at 12:20
You are welcome. That is for integer division so that both 2//2 and 3//2 yield 1, 4//2 and 5//2 yield 2... (and put into the same group). – ayhan Apr 23 '16 at 12:21
Got it! Thanks a lot! – Gnu Apr 23 '16 at 12:24
3

It might be better to group on `np.arange(len(df))//2` instead, in case the index isn't simply 0,1,2.. etc. – DSM Apr 23 '16 at 13:45
1

And a bit faster is `df.groupby(np.arange(len(df.index))//2).mean()` ;) – jezrael Jun 01 '17 at 11:08

seeiespi · Answer 2 · 2020-04-04T21:09:17.727

33

You can approach this problem using pd.rolling() to create a rolling average and then just grab every second element using iloc

df = df.rolling(2).mean() 
df = df.iloc[::2, :]

Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.

edited Apr 04 '20 at 21:09

answered Feb 27 '18 at 19:19

seeiespi

3,628
2
35
37

6

This is so much more elegant than the previous answers. – Julio Diaz Sep 26 '19 at 15:00

score 9 · Answer 3 · edited May 23 '17 at 11:46

NumPythonic way would be to extract the elements as a NumPy array with df.values, then reshape to a 3D array with 2 elements along axis=1 and 4 along axis=2 and perform the average reduction along axis=1 and finally convert back to a dataframe, like so -

pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))

As it turns out, you can introduce NumPy's very efficient tool : np.einsum to do this average-reduction as a combination of sum-reduction and scaling-down, like so -

pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)

Please note that the proposed approaches assume that the number of rows is divisible by 2.

Also as noted by @DSM, to preserve the column names, you need to add columns=df.columns when converting back to Dataframe, i.e. -

pd.DataFrame(...,columns=df.columns)

Sample run -

>>> df
    0   1   2   3
0   2  50  25  26
1   4  11  38  44
2   6  33  16  25
3   8  37  27  25
4  10  28  48  32
5  12  47  35  45
6  14   8  16   7
7  16  12  16  30
8  18  22  39  29
9  20   9  15  47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0

Runtime tests -

In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby.

In [24]: A = np.random.randint(0,9,(200,50))

In [25]: df = pd.DataFrame(A)

In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop

In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 µs per loop

In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 µs per loop

@ayhan Yeah I noticed that, was about to post runtimes, but then final thought was to let it pass :) Yours is a new thing I learned though, pandas not being my thing! — Divakar, Apr 23 '16 at 12:32
Note that your numpy approaches lose the column names. You could add `columns=df.columns` to correct this. — DSM, Apr 23 '16 at 13:36
@DSM Yeah, totally missed that! Thanks! Added that as a note. — Divakar, Apr 23 '16 at 13:42

score 5 · Answer 4 · answered Aug 05 '17 at 20:31

5

df.set_index(np.arange(len(df)) // 2).mean(level=0)

answered Aug 05 '17 at 20:31

piRSquared

285,575
57
475
624

Gonçalo Peres · Answer 5 · 2021-01-07T13:52:06.627

3

In your case, as you want to average the rows, assuming your dataframe name is new

new = new.groupby(np.arange(len(new)) // 2).mean()

If one wants to do the average for the columns

new = new.groupby(np.arrange(len(new.columns)) // 2, axis=1).mean()

edited Jan 07 '21 at 13:52

answered Jan 07 '21 at 13:23

Gonçalo Peres

11,752
3
54
83

score 1 · Answer 6 · answered Mar 13 '21 at 10:04

I got ValueError: Grouper and axis must be same length when I tried using numpy to create the artificial group. As an alternative, you can use itertools which will generate an iterator of equal length to your Dataframe:

SAMPLE_SIZE = 2
label_series = pd.Series(itertools.chain.from_iterable(itertools.repeat(x, SAMPLE_SIZE) for x in df.index))
sampled_df = df.groupby(label_series).mean()

Calculate average of every x rows in a table and create new table

6 Answers6

Linked

Related