I have this data frame, df, that has boolean values :
A B C
0 0 1 0
1 0 1 1
2 0 1 1
3 1 0 1
4 0 0 0
5 1 0 0
6 0 0 0
7 0 0 1
8 1 0 0
9 0 0 0
10 1 0 1
11 1 0 1
12 0 1 1
13 1 0 0
14 1 0 0
15 0 1 0
16 1 1 0
17 0 0 1
18 1 0 1
19 1 0 0
20 1 0 1
21 1 1 0
22 1 1 1
23 1 1 1
24 1 0 0
25 1 1 0
26 0 0 1
27 0 1 1
28 0 1 0
29 1 1 0
30 1 0 1
31 0 1 0
32 0 0 1
33 1 1 1
34 0 1 0
35 1 1 0
36 0 1 0
37 0 0 1
38 0 1 1
39 0 1 1
I stored the count of rows as follows :
N = len(df.index) # 40 in this case
Using groupby , I counted each instantiation of df as follows :
count_series = df.groupby(["A", "B", "C"]).size() #all columns
new_df = count_series.to_frame(name = 'count').reset_index()
print(new_df)
The new_df looks like this :
A B C count
0 0 0 0 3
1 0 0 1 5
2 0 1 0 6
3 0 1 1 6
4 1 0 0 6
5 1 0 1 6
6 1 1 0 5
7 1 1 1 3
Now df row count is N=40 and I want to create a new dataframe ,dfD, that has the same columns as df plus additional column named P(A,B,C) which has the probability of each combination. for example , any row with the values 0,0,0 should have count/N (3/40) which is 0.075 I found these posts but all of them did not help because they are using cases since my df wont just have 3 columns (A,B,C) or just 40 rows. it might be bigger that that link1 link2 I want something that works with any dataframe of any size