0

I have this data frame, df, that has boolean values :

    A  B  C
0   0  1  0
1   0  1  1
2   0  1  1
3   1  0  1
4   0  0  0
5   1  0  0
6   0  0  0
7   0  0  1
8   1  0  0
9   0  0  0
10  1  0  1
11  1  0  1
12  0  1  1
13  1  0  0
14  1  0  0
15  0  1  0
16  1  1  0
17  0  0  1
18  1  0  1
19  1  0  0
20  1  0  1
21  1  1  0
22  1  1  1
23  1  1  1
24  1  0  0
25  1  1  0
26  0  0  1
27  0  1  1
28  0  1  0
29  1  1  0
30  1  0  1
31  0  1  0
32  0  0  1
33  1  1  1
34  0  1  0
35  1  1  0
36  0  1  0
37  0  0  1
38  0  1  1
39  0  1  1

I stored the count of rows as follows :

N = len(df.index) # 40 in this case

Using groupby , I counted each instantiation of df as follows :

    count_series = df.groupby(["A", "B", "C"]).size() #all columns
    new_df = count_series.to_frame(name = 'count').reset_index()
    print(new_df)

The new_df looks like this :

   A  B  C  count
0  0  0  0     3
1  0  0  1     5
2  0  1  0     6
3  0  1  1     6
4  1  0  0     6
5  1  0  1     6
6  1  1  0     5
7  1  1  1     3

Now df row count is N=40 and I want to create a new dataframe ,dfD, that has the same columns as df plus additional column named P(A,B,C) which has the probability of each combination. for example , any row with the values 0,0,0 should have count/N (3/40) which is 0.075 I found these posts but all of them did not help because they are using cases since my df wont just have 3 columns (A,B,C) or just 40 rows. it might be bigger that that link1 link2 I want something that works with any dataframe of any size

Cav
  • 29
  • 5

3 Answers3

0

Convert each row into tuple and use groupby

grp = df.apply(tuple, axis=1)
pd.concat([df.groupby(grp).first(),
           grp.groupby(grp).count().div(len(df)).rename("Probs")],
          axis=1).reset_index(drop=True)
d.b
  • 32,245
  • 6
  • 36
  • 77
  • when I do that, all the rows have equal probability of 0.150 which is incorrect – Cav Mar 30 '22 at 18:22
  • the first approach produces 0.150 for each row. The second approach generates a dataframe similar to new_df with probabilities which is what I wanted thanks a lot – Cav Mar 30 '22 at 18:30
0

We can use value_counts with parameter normalize=True to calculate the probabilities, then merge the resulting probablities with the original dataframe

c = ['A', 'B', 'C']
df.merge(df[c].value_counts(normalize=True).reset_index(name='prob'))

    A  B  C   prob
0   0  1  0  0.150
1   0  1  0  0.150
2   0  1  0  0.150
3   0  1  0  0.150
...
18  0  0  0  0.075
...
38  1  1  1  0.075
39  1  1  1  0.075
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
0

Continuing your own approach ...

# subset of columns you want to use
columns = new_df.columns[:-1]
# use np.add.reduce to concatenate with astype(str)
new_df['key'] = np.add.reduce(new_df[columns].astype(str), axis=1)
new_df['prob'] = new_df['count'] / N

# do the same in df
df['key'] = np.add.reduce(df.astype(str), axis=1)

# maintain a mapping
mapping = new_df[['key', 'prob']].set_index('key').to_dict()['prob']

# map it
df['P(ABC)'] = df['key'].map(mapping)
df.drop('key', axis=1, inplace=True)
df[columns] = df[columns].astype(int)  # or astype(bool)
df

    A   B   C   P(ABC)
0   0   1   0   0.150
1   0   1   1   0.150
2   0   1   1   0.150
3   1   0   1   0.150
4   0   0   0   0.075
5   1   0   0   0.150
6   0   0   0   0.075
7   0   0   1   0.125
8   1   0   0   0.150
9   0   0   0   0.075
10  1   0   1   0.150
11  1   0   1   0.150
12  0   1   1   0.150
13  1   0   0   0.150
14  1   0   0   0.150
15  0   1   0   0.150
16  1   1   0   0.125
17  0   0   1   0.125
18  1   0   1   0.150
19  1   0   0   0.150
20  1   0   1   0.150
21  1   1   0   0.125
22  1   1   1   0.075
23  1   1   1   0.075
24  1   0   0   0.150
25  1   1   0   0.125
26  0   0   1   0.125
27  0   1   1   0.150
28  0   1   0   0.150
29  1   1   0   0.125
30  1   0   1   0.150
31  0   1   0   0.150
32  0   0   1   0.125
33  1   1   1   0.075
34  0   1   0   0.150
35  1   1   0   0.125
36  0   1   0   0.150
37  0   0   1   0.125
38  0   1   1   0.150
39  0   1   1   0.150
akilat90
  • 5,436
  • 7
  • 28
  • 42