Creating new pandas columns based on conditions on existing columns

Question

I have a dataframe as shown below:

col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])

    name    count
0   a       1
1   b       1
2   c       0
3   a       1
4   c       1
5   a       0
6   b       1
7   c       1
8   a       0

I am trying to find the ratio of the number of zeros to the sum of zeros+ones corresponding to each element in the "name" column. Firstly i aggreated the counts as follows:

for j in df2.name.unique():
    print(j)
    zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
    full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
    zero_pb = zero_ct / full_ct
    one_pb = 1 - zero_pb
    print(f"ZERO rations for {j} = {zero_pb}")
    print(f"One ratios for {j} = {one_pb}")
    print("="*30)

And the output looks like:

a
ZERO ratios for a = 0    0.5
dtype: float64
One ratios for a = 0    0.5
dtype: float64
==============================
b
ZERO ratios for b = 1    0.0
dtype: float64
One ratios for b = 1    1.0
dtype: float64
==============================
c
ZERO ratios for c = 2    0.333333
dtype: float64
One ratios for c = 2    0.666667
dtype: float64
==============================

My goal is to add 2 new columns to the dataframe: "name_0" and "name_1" with th ratio values for each element in the "name" column. I tried something but its not giving the expected results:

for j in df2.name.unique():
    print(j)
    zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
    full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
    zero_pb = zero_ct / full_ct
    one_pb = 1 - zero_pb
    print(f"ZERO Probablitliy for {j} = {zero_pb}")
    print(f"One Probablitliy for {j} = {one_pb}")
    print("="*30)
    
    condition1 = [ df2['name'].eq(j) & df2['count'].eq(0)]
    condition2 = [ df2['name'].eq(j) & df2['count'].eq(1)]
    choice1 = zero_pb.tolist()
    choice2 = one_pb.tolist()

    print(f'choice1 = {choice1}, choice2 = {choice2}')
    df2["name"+str("_0")] = np.select(condition1, choice1, default=0)
    df2["name"+str("_1")] = np.select(condition2, choice2, default=0)

The column is updated with the values of the name element 'c'. It's to be expected as the last computed values are being used to update all the values.

Is there another way to use the np.select effectively?

Expected output:

    name    count   name_0      name_1
0   a       1       0.000000    0.500000
1   b       1       0.000000    1.000000
2   c       0       0.333333    0.000000
3   a       1       0.000000    0.500000
4   c       1       0.000000    0.666667
5   a       0       0.500000    0.000000
6   b       1       0.000000    1.000000
7   c       1       0.000000    0.666667
8   a       0       0.500000    0.000000

Please post your expected output based on `df2`. – Mayank Porwal Oct 22 '20 at 07:04 — Mayank Porwal, Oct 22 '20 at 07:04
Hi Mayank: Edited my post for better clarification – The Owl Oct 22 '20 at 07:17 — The Owl, Oct 22 '20 at 07:17

score 1 · Accepted Answer · answered Oct 22 '20 at 10:14

I did not have access to zero_one_frequencies df. So I took the liberty of trying to solve the problem my way.

import pandas as pd
import numpy as np
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])

df2["name_0"] = 0
df2["name_1"] = 0

for name in df2['name'].unique():
  df_name = df2[df2['name'] == name]
  prob_1 = sum(df_name['count']/df_name.shape[0])
  for count in df2['count'].unique():
    indx = np.where((df2['name'] == name) & (df2['count'] == count))
    df2["name_" + str(count)].loc[indx] = np.abs(((count +1) % 2) - prob_1)

Output:

name    count   name_0  name_1
0   a   1   0.000000    0.500000
1   b   1   0.000000    1.000000
2   c   0   0.333333    0.000000
3   a   1   0.000000    0.500000
4   c   1   0.000000    0.666667
5   a   0   0.500000    0.000000
6   b   1   0.000000    1.000000
7   c   1   0.000000    0.666667
8   a   0   0.500000    0.000000

For understanding np.select I recommend seeing this post.

Thanks for the crisp code @Oddaspa. It looks much cleaner than mine :) ` zero_one_frequencies = pd.crosstab(df2['name'], df2['count'])\ .reset_index().rename(columns={'index': 'count'})\ .rename_axis(None, axis='columns') ` Heres the zero_one_frequencies code I used — The Owl, Oct 23 '20 at 04:37

score 0 · Answer 2 · answered Oct 22 '20 at 09:53

The following code fixed the issue. But, I couldn't find a way to get the same using numpy.select though.

df2["name"+str("_0")] = 0.0
df2["name"+str("_1")] = 0.0
for j in df2.name.unique():
    print(j)
    zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
    full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
    zero_pb = zero_ct / full_ct
    one_pb = 1 - zero_pb
    print(f"ZERO Probablitliy for {j} = {zero_pb.tolist()[0]}")
    print(f"One Probablitliy for {j} = {one_pb.tolist()[0]}")
    print("="*30)
    for idx in df2[df2['name']== j ].index:
        print("Index:::", idx)
        if df2['count'].iloc[idx] == 0:
            df2.at[idx, "name"+str("_0")] = zero_pb.tolist()[0]
            print(f'Count for {j} at index {idx} is {a}')
            print('printing name_0: ', df2["name"+str("_0")].iloc[idx])
            print("*"*30)
        elif df2['count'].iloc[idx] == 1:
            df2.at[idx, "name"+str("_1")] = one_pb.tolist()[0]
            print(f'Count for {j} at index {idx} is {b}')
            print('printing name_1: ', df2["name"+str("_1")].iloc[idx])
            print("*"*30)

Creating new pandas columns based on conditions on existing columns

2 Answers2