Conditionally Aggregating Pandas DataFrame

Question

I have a DataFrame that looks like:

import pandas as pd

df = pd.DataFrame([[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
                   [9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0],
                   [17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0]], 
                   columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])

      A     B     C     D     E     F     G     H
0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0
1   9.0  10.0  11.0  12.0  13.0  14.0  15.0  16.0
2  17.0  18.0  19.0  20.0  21.0  22.0  23.0  24.0

And I have a list of columns:

l = ['A', 'C', 'D', 'E']

For each element of my list, I want to get the mean of the dataframe columns that precede it plus twice the value in its own column. So, A will only depend on itself, C will depend on A and itself, D will depend on the sum of A, C, and itself, and E will depend on A, C, D, and itself. I have accomplished what I need in the following way:

for i, col in enumerate(l):
    other_cols = l[:i]
    df['tmp_' + col] = df[other_cols].mean(axis=1) + 2.0 * df[col]

      A     B     C     D     E     F     G     H  tmp_A  tmp_C  tmp_D  \
0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0    NaN    7.0   10.0   
1   9.0  10.0  11.0  12.0  13.0  14.0  15.0  16.0    NaN   31.0   34.0   
2  17.0  18.0  19.0  20.0  21.0  22.0  23.0  24.0    NaN   55.0   58.0   

       tmp_E  
0  12.666667  
1  36.666667  
2  60.666667

I was wondering if there was an even more Pythonic way to accomplish the same thing rather than having to run through the for loop?

Is it the `sum` or `mean`? In your question, it says the sum of columns in your code however it's mean? And also why is tmp_A `NaN`? — Psidom, Jul 19 '16 at 01:27
I would have guessed from your text that `tmp_A` would be twice `df["A"]`, but your code produces NaN. Just to be clear, that's what you want? — DSM, Jul 19 '16 at 01:50
It should be twice and not NaN but I don't know if there is a good way to handle that case beyond an if statement — slaw, Jul 19 '16 at 01:52

score 1 · Answer 1 · answered Jul 19 '16 at 01:58

IIUC, you can use expanding in modern pandas to handle this:

>>> cols = ["A","C","D","E"]
>>> df[cols] * 2 + df[cols].expanding(axis=1).mean().shift(axis=1).fillna(0)

      A     C     D          E
0   2.0   7.0  10.0  12.666667
1  18.0  31.0  34.0  36.666667
2  34.0  55.0  58.0  60.666667

This reproduces your expected new columns (and has A become twice its original value, thanks to the fillna turning the NaNs to 0s).

We can see where this comes from step by step:

Starting from

>>> df[cols]

      A     C     D     E
0   1.0   3.0   4.0   5.0
1   9.0  11.0  12.0  13.0
2  17.0  19.0  20.0  21.0

>>> df[cols].expanding(axis=1)
Expanding [min_periods=1,center=False,axis=1]

We can do sum first, because it's easier to check visually:

>>> df[cols].expanding(axis=1).sum()

      A     C     D     E
0   1.0   4.0   8.0  12.0
1   9.0  20.0  32.0  36.0
2  17.0  36.0  56.0  60.0

>>> df[cols].expanding(axis=1).mean()

      A     C          D     E
0   1.0   2.0   2.666667   4.0
1   9.0  10.0  10.666667  12.0
2  17.0  18.0  18.666667  20.0

>>> df[cols].expanding(axis=1).mean().shift(axis=1)

    A     C     D          E
0 NaN   1.0   2.0   2.666667
1 NaN   9.0  10.0  10.666667
2 NaN  17.0  18.0  18.666667

>>> df[cols].expanding(axis=1).mean().shift(axis=1).fillna(0)

     A     C     D          E
0  0.0   1.0   2.0   2.666667
1  0.0   9.0  10.0  10.666667
2  0.0  17.0  18.0  18.666667

This was an informative and educational answer! The `Expanding` function is new to me and I doubt that I would've understood it from the docs alone without seeing an illustrative example. — slaw, Jul 19 '16 at 11:30
In `df[cols].expanding(axis=1).sum()`, how come the sum of the last column, `E`, is incorrect? Shouldn't it be 13, 45, 77? — slaw, Jul 19 '16 at 13:45
Additionally, I notice that the columns aren't summed if my dataframe only contains one row of values (i.e. `df = pd.DataFrame([[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]], columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])`) — slaw, Jul 19 '16 at 14:51
I think what I needed was simply cumsum. Expanding is only for neighboring columns — slaw, Jul 21 '16 at 01:27

Conditionally Aggregating Pandas DataFrame

1 Answers1

Linked