1

I have a dataframe which is not ordered:

df
     A   B  Moves
0   E1  E2     10
1   E1  E3     20
2   E1  E4     15
3   E2  E1      9
4   E2  E3      8
5   E2  E4      7
6   E3  E1     30
7   E3  E2     32
8   E3  E4     40
9   E4  E1      5
10  E4  E2     20
11  E4  E3      3

I want to return rows B until their cumulative sum adds up to some minimum % of the overall total Moves for each grouping of B in A (where I take the highest first).

Once the % threshold is reached, I stop taking rows (a cumulative sum). The procedure must be "greedy" in that if a row takes it above the desired % it includes that row.

If the minimum percentage of the total is 50%, then I want to return first:

Desired Output

     A   B  Moves
    E1  E3     20
    E1  E4     15
    E2  E1      9
    E2  E3      8
    E3  E4     40
    E3  E2     32
    E4  E2     20

I would then want to extract the row names for each grouping using df.groupby(...).apply(list) fromthis question

A     Most_Moved
E1      [E3, E4] 
E2      [E1, E3]
E3      [E4, E2]
E4          [E2]

What I've tried:

I can return the Total_Moves ordered using cumsum in this question and this question:

df.groupby(by=['A','B']).sum().groupby(level=[0]).cumsum()[::-1]

       Moves
A  B        
E4 E3     28
   E2     25
   E1      5
E3 E4    102
   E2     62
   E1     30
E2 E4     24
   E3     17
   E1      9
E1 E4     45
   E3     30
   E2     10

Separately I can return the total moves (sum) for each group:

df.groupby(by="A").sum()

    Moves
A        
E1     45
E2     24
E3    102
E4     28

From this question and this question I can return each row as a percentage of the sum for that category:

df.groupby(by=["A"])["Moves"].apply(lambda x: 100 * x / float(x.sum()))

0     22.222222
1     44.444444
2     33.333333
3     37.500000
4     33.333333
5     29.166667
6     29.411765
7     31.372549
8     39.215686
9     17.857143
10    71.428571
11    10.714286

What doesn't work

However, if I combine these, it evaluates the percentage of the overall rows:

df.groupby(by=["A", "B"])["Moves"].agg({"Total_Moves":sum}).sort_values("Total_Moves", ascending=False).apply(lambda x: 100 * x / float(x.sum()))

       Total_Moves
A  B              
E3 E4    20.100503
   E2    16.080402
   E1    15.075377
E1 E3    10.050251
E4 E2    10.050251
E1 E4     7.537688
   E2     5.025126
E2 E1     4.522613
   E3     4.020101
   E4     3.517588
E4 E1     2.512563
   E3     1.507538

This evaluates the percentage across the entire dataframe, not within the individual groups.

I just can't figure out how to piece this together to get my output.

Any help appreciated.

Chuck
  • 3,664
  • 7
  • 42
  • 76

1 Answers1

3

you can use groupby.apply with a custom function

def select(group, pct=50):
    # print(group)
    moves = group['Moves'].sort_values(ascending=False)
    cumsum = moves.cumsum() / moves.sum()
    # print(cumsum)
    # `cumsum` is the cumulative contribution of the sorted moves
    idx = len(cumsum[cumsum < pct/100]) + 1
    # print(idx)
    # `idx` is the first index of the move which has a cumulative sum of `pct` or higher
    idx = moves.index[:idx]  
    # print(idx)
    # here, `idx` is the Index of all the moves in with a cumulative contribution of `pct` or higher
    # print(group.loc[idx])
    return group.loc[idx].set_index(['B'], drop=True)['Moves']
    # return a Series of Moves with column `B` as index of the items which have index `idx`
df.groupby('A').apply(select)
        Moves
A   B   
E1  E3  20
    E4  15
E2  E1  9
    E3  8
E3  E4  40
    E2  32
E4  E2  20

edit

I've added some comments to the code. To make even more clear what it does, I also added (commented) print statements of the intermediary variables. If you uncomment them, don't be surprised that the first group gets printed twice

Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36