I have a dataframe as shown below:
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])
name count
0 a 1
1 b 1
2 c 0
3 a 1
4 c 1
5 a 0
6 b 1
7 c 1
8 a 0
I am trying to find the ratio of the number of zeros to the sum of zeros+ones corresponding to each element in the "name" column. Firstly i aggreated the counts as follows:
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO rations for {j} = {zero_pb}")
print(f"One ratios for {j} = {one_pb}")
print("="*30)
And the output looks like:
a
ZERO ratios for a = 0 0.5
dtype: float64
One ratios for a = 0 0.5
dtype: float64
==============================
b
ZERO ratios for b = 1 0.0
dtype: float64
One ratios for b = 1 1.0
dtype: float64
==============================
c
ZERO ratios for c = 2 0.333333
dtype: float64
One ratios for c = 2 0.666667
dtype: float64
==============================
My goal is to add 2 new columns to the dataframe: "name_0" and "name_1" with th ratio values for each element in the "name" column. I tried something but its not giving the expected results:
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO Probablitliy for {j} = {zero_pb}")
print(f"One Probablitliy for {j} = {one_pb}")
print("="*30)
condition1 = [ df2['name'].eq(j) & df2['count'].eq(0)]
condition2 = [ df2['name'].eq(j) & df2['count'].eq(1)]
choice1 = zero_pb.tolist()
choice2 = one_pb.tolist()
print(f'choice1 = {choice1}, choice2 = {choice2}')
df2["name"+str("_0")] = np.select(condition1, choice1, default=0)
df2["name"+str("_1")] = np.select(condition2, choice2, default=0)
The column is updated with the values of the name element 'c'. It's to be expected as the last computed values are being used to update all the values.
Is there another way to use the np.select effectively?
Expected output:
name count name_0 name_1
0 a 1 0.000000 0.500000
1 b 1 0.000000 1.000000
2 c 0 0.333333 0.000000
3 a 1 0.000000 0.500000
4 c 1 0.000000 0.666667
5 a 0 0.500000 0.000000
6 b 1 0.000000 1.000000
7 c 1 0.000000 0.666667
8 a 0 0.500000 0.000000