I have a dataframe which is not ordered:
df
A B Moves
0 E1 E2 10
1 E1 E3 20
2 E1 E4 15
3 E2 E1 9
4 E2 E3 8
5 E2 E4 7
6 E3 E1 30
7 E3 E2 32
8 E3 E4 40
9 E4 E1 5
10 E4 E2 20
11 E4 E3 3
I want to return rows B
until their cumulative sum adds up to some minimum % of the overall total Moves
for each grouping of B
in A
(where I take the highest first).
Once the % threshold is reached, I stop taking rows (a cumulative sum). The procedure must be "greedy" in that if a row takes it above the desired % it includes that row.
If the minimum percentage of the total is 50%, then I want to return first:
Desired Output
A B Moves
E1 E3 20
E1 E4 15
E2 E1 9
E2 E3 8
E3 E4 40
E3 E2 32
E4 E2 20
I would then want to extract the row names for each grouping using df.groupby(...).apply(list)
fromthis question
A Most_Moved
E1 [E3, E4]
E2 [E1, E3]
E3 [E4, E2]
E4 [E2]
What I've tried:
I can return the Total_Moves ordered using cumsum
in this question and this question:
df.groupby(by=['A','B']).sum().groupby(level=[0]).cumsum()[::-1]
Moves
A B
E4 E3 28
E2 25
E1 5
E3 E4 102
E2 62
E1 30
E2 E4 24
E3 17
E1 9
E1 E4 45
E3 30
E2 10
Separately I can return the total moves (sum) for each group:
df.groupby(by="A").sum()
Moves
A
E1 45
E2 24
E3 102
E4 28
From this question and this question I can return each row as a percentage of the sum for that category:
df.groupby(by=["A"])["Moves"].apply(lambda x: 100 * x / float(x.sum()))
0 22.222222
1 44.444444
2 33.333333
3 37.500000
4 33.333333
5 29.166667
6 29.411765
7 31.372549
8 39.215686
9 17.857143
10 71.428571
11 10.714286
What doesn't work
However, if I combine these, it evaluates the percentage of the overall rows:
df.groupby(by=["A", "B"])["Moves"].agg({"Total_Moves":sum}).sort_values("Total_Moves", ascending=False).apply(lambda x: 100 * x / float(x.sum()))
Total_Moves
A B
E3 E4 20.100503
E2 16.080402
E1 15.075377
E1 E3 10.050251
E4 E2 10.050251
E1 E4 7.537688
E2 5.025126
E2 E1 4.522613
E3 4.020101
E4 3.517588
E4 E1 2.512563
E3 1.507538
This evaluates the percentage across the entire dataframe, not within the individual groups.
I just can't figure out how to piece this together to get my output.
Any help appreciated.