Pandas: Conditional Aggregation

Question

I am trying to develop the following filter with a pandas dataframe:

I have four columns, A, B, A_prime and B_prime;
if A or B is less than a threshold, C, then I would like to find the sum between A_prime and B_prime and allocate it to the maximum of A_prime and B_prime, while setting the minimum of A_prime and B_prime to zero.

How would I write this as a pandas aggregation function?

A working example, written inefficiently, is shown below:

import pandas as pd
import numpy as np

data = {
    "A":list(np.abs(np.random.randn(10))),
    "B":list(np.abs(np.random.randn(10))),
    "A_prime":list(np.abs(np.random.randn(10))),
    "B_prime":list(np.abs(np.random.randn(10)))
    
}

df = pd.DataFrame.from_dict(data)
C = 0.2

print("BEFORE:")
print(df)


for index, row in df.iterrows():
    if(row["A"] < C or row["B"] < C):
        max_idx = np.argmax([row["A"], row["B"]])
        if(max_idx==0):
            row["A_prime"] = row["A_prime"] + row["B_prime"]
            row["B_prime"] = 0
        else:
            row["B_prime"] = row["A_prime"] + row["B_prime"]
            row["A_prime"] = 0
    
print("")
print("AFTER:")
print(df)

Output:

BEFORE:
          A         B   A_prime   B_prime
0  0.182445  0.924890  1.563398  0.562325
1  0.252587  0.273637  0.515395  0.538876
2  1.369412  1.985702  1.813962  1.643794
3  0.834666  0.143880  0.860673  0.372468
4  1.380012  0.715774  0.022681  0.892717
5  0.582497  0.477100  0.956821  1.134613
6  0.083045  0.322060  0.362513  1.386124
7  1.384267  0.251577  0.639843  0.458650
8  0.375456  0.412320  0.661661  0.086588
9  0.079226  0.385621  0.601451  0.837827

AFTER:
          A         B   A_prime   B_prime
0  0.182445  0.924890  0.000000  2.125723
1  0.252587  0.273637  0.515395  0.538876
2  1.369412  1.985702  1.813962  1.643794
3  0.834666  0.143880  1.233141  0.000000
4  1.380012  0.715774  0.022681  0.892717
5  0.582497  0.477100  0.956821  1.134613
6  0.083045  0.322060  0.000000  1.748638
7  1.384267  0.251577  0.639843  0.458650
8  0.375456  0.412320  0.661661  0.086588
9  0.079226  0.385621  0.000000  1.439278

Should't the first line in your example allocate the max to A_prime instead of B_prime (as it is the greater number)? — nocibambi, Jun 07 '21 at 09:53
@nocibambi From their code, I believe check is over `A` and `B` rather than primes to determine which `*_prime` to put the sum. — Mustafa Aydın, Jun 07 '21 at 09:59
@MustafaAydın Yeah, I read the description, not the code. :) I suppose we wait and find out what the intent was. — nocibambi, Jun 07 '21 at 10:15

score 2 · Accepted Answer · answered Jun 07 '21 at 09:58

here is one way:

prime_cols = ["A_prime", "B_prime"]

# get the candidate sums
prime_sums = df[prime_cols].sum(axis=1)

# check which rows satisfy the `C` threshold
threshold_satisfied = df.A.lt(C) | df.B.lt(C)

# set the satisfying rows' values to sums for both columns
df.loc[threshold_satisfied, prime_cols] = prime_sums

# generate a 1-0 mask that will multiply the greater value by 1 and
# smaller value by 0 to "select" one of them and kill other
mask_A_side = df.A.gt(df.B)
the_mask = pd.concat([mask_A_side, ~mask_A_side], axis=1).set_axis(prime_cols, axis=1)

# multiply with the mask
df.loc[threshold_satisfied, prime_cols] *= the_mask

which first puts the sum of prime columns to both of the columns where threshold condition is satisfied, and then kills one of them with a 1-0 mask multiplication.

to get

>>> df

          A         B   A_prime   B_prime
0  0.182445  0.924890  0.000000  2.125723
1  0.252587  0.273637  0.515395  0.538876
2  1.369412  1.985702  1.813962  1.643794
3  0.834666  0.143880  1.233141  0.000000
4  1.380012  0.715774  0.022681  0.892717
5  0.582497  0.477100  0.956821  1.134613
6  0.083045  0.322060  0.000000  1.748637
7  1.384267  0.251577  0.639843  0.458650
8  0.375456  0.412320  0.661661  0.086588
9  0.079226  0.385621  0.000000  1.439278

nocibambi · Answer 2 · 2021-06-07T10:23:12.687

1

You can rewrite values based on a conditional map and apply.

# Columns to check the min/max on:
check_max_cols = ["A_prime", "B_prime"]


def allocate_sum(row):
    # Identify max and min values
    max_col = row[check_max_cols].idxmax(axis=1)
    min_col = check_max_cols[1] if max_col == check_max_cols[0] else check_max_cols[0]

    row[max_col] = row[["A_prime", "B_prime"]].sum()
    row[min_col] = 0
    return row


below_threshold = (df[["A", "B"]] < C).any(axis=1)

df.loc[below_threshold, :] = df.loc[below_threshold, :].apply(allocate_sum, axis=1)

edited Jun 07 '21 at 10:23

answered Jun 07 '21 at 09:55

nocibambi

2,065
1
16
22

This is what I was looking for. Thank you! – Jack Rolph Jun 07 '21 at 10:49

score 1 · Answer 3 · answered Jun 07 '21 at 10:05

The most pandified way of doing this would be with an apply(), as the example below:

import pandas as pd
import numpy as np

data = {
    "A":list(np.abs(np.random.randn(10))),
    "B":list(np.abs(np.random.randn(10))),
    "A_prime":list(np.abs(np.random.randn(10))),
    "B_prime":list(np.abs(np.random.randn(10)))
    
}

df = pd.DataFrame.from_dict(data)
C = 0.2

def A_B_prime(row):
    A_prime_val = row["A_prime"]
    B_prime_val = row["B_prime"]
    if(row["A"] < C or row["B"] < C):
        max_idx = np.argmax([row["A"], row["B"]])
        if(max_idx==0):
            A_prime_val = row["A_prime"] + row["B_prime"]
            B_prime_val = 0
        else:
            B_prime_val = row["A_prime"] + row["B_prime"]
            A_prime_val = 0
    return A_prime_val, B_prime_val

df['A_prime'], df['B_prime'] = zip(*df.apply(A_B_prime, axis=1))

You can find some good insights on how to return multiple columns from a single apply() on this thread.

score 0 · Answer 4 · answered Jun 07 '21 at 10:42

I don't know if this can be done with aggregation. I am sharing something that has worked for me. See if it works for you too.

Firstly, we can generate the masks

#Get mask of all the rows where A_Prime > B_Prime and either of A or B less than C(0.2)
mask_val_a = ((dataset['A'] < C) | (dataset['B'] < C)) & \
             (dataset['A_Prime'] > dataset['B_Prime']) 

#Get mask of all the rows where A_Prime < B_Prime and either of A or B less than C(0.2)
mask_val_b = ((dataset['A'] < C) | (dataset['B'] < C)) & (dataset['A_Prime'] < dataset['B_Prime'])

Once we have masks values, we can just use pandas mask to update the values

#Firstly we will update the values where A_Prime > B_Prime
dataset['A_Prime'] = dataset['A_Prime'].mask(mask_val_a, dataset[['A_Prime', 'B_Prime']].sum(axis = 1))
dataset['B_Prime'] = dataset['B_Prime'].mask(mask_val_a, 0)

#Then we will update the values where B_Prime > A_Prime
dataset['B_Prime'] = dataset['B_Prime'].mask(mask_val_b, dataset[['A_Prime', 'B_Prime']].sum(axis = 1))
dataset['A_Prime'] = dataset['A_Prime'].mask(mask_val_b, 0)

Pandas: Conditional Aggregation

4 Answers4