1

I have a table like this,

Hour represents the hour of the day agent reached the customer successfully.

Cus1 to Cus4 are the top 4 time slots provided to the agent to call the customer.

Cus1_Score to Cus4_Score represents the probability of success for the call during the hours corresponding to Cus1 to Cus4.

I need to get YES or NO values for Match1 and Match2 columns.

Match1 represents this - Check for the highest score in the columns Cus1_Score to Cus4_Score. If no duplicates, check if Cus1 = Hour and if it's a match we write YES else NO. If duplicates exist, check all the columns with highest score and and check if the Hour number matches to any of the values in the high score represented Cus columns(Cus1 to Cus4). Again, if it's match YES else NO.

Match2 represents this - Check for the second highest score in the columns Cus1_Score to Cus4_Score. If no duplicates, check if Cus2 = Hour and if it's a match we write YES else NO. If duplicates exist, check all the columns with second highest score and and check if the Hour number matches to any of the values in the second high score represented Cus columns(Cus1 to Cus4). Again, if it's match YES else NO.

ID   Hour  Cus1  Cus2  Cus3  Cus4  Cus1_Score  Cus2_Score  Cus3_Score  Cus4_Score  Match1  Match2

1     11    8     10    11    14      0.62        0.59        0.59        0.54       NO     YES 

2     13    15    16    18    13      0.57        0.57        0.57        0.57       YES    NO

3     16    09    14    16    12      0.67        0.54        0.48        0.34       NO     NO

4     08    11    08    12    17      0.58        0.55        0.43        0.25       NO     YES

I tried using idxmax() and nlargest(2) functions and I have no luck as I am not very strong in Python. Highly appreciate your help.

Raju
  • 35
  • 4
  • row2 should be a yes, since it matches cus4, which has a second lowest of 0.57 (that's the only value in that row) – sammywemmy Oct 08 '21 at 06:09

2 Answers2

1

Hopefully I got the conditions right; the calculations below rely on Pandas aligning on the index before assigning values; so I create a long dataframe, create conditions off the long dataframe, reduce the size to unique indices and assign the outcome to the original dataframe:

Create a long dataframe to work out the conditions(I'm using pyjanitor pivot_longer to reshape here because of the convenience; you can do this in a couple of steps without pyjanitor):

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor as jn
import pandas as pd
res = (df.pivot_longer(['ID', 'Hour'], 
                       names_pattern= [r'.+\d$', r'.+Score$'], 
                       names_to = ['cus', 'score'], 
                       ignore_index = False)
       )

print(res)

   ID  Hour  cus  score
0   1    11    8   0.62
1   2    13   15   0.57
2   3    16    9   0.67
3   4     8   11   0.58
0   1    11   10   0.59
1   2    13   16   0.57
2   3    16   14   0.54
3   4     8    8   0.55
0   1    11   11   0.59
1   2    13   18   0.57
2   3    16   16   0.48
3   4     8   12   0.43
0   1    11   14   0.54
1   2    13   13   0.57
2   3    16   12   0.34
3   4     8   17   0.25

Create conditions for match1:

# get booleans for max and no duplicates
res = res.sort_index()
max1 = res.groupby(level=0).score.transform('max')
# max without duplicates
cond1 = res.score.eq(max1).groupby(level=0).sum().eq(1)
cond1 = cond1 & df.Hour.eq(df.Cus1)
# max with duplicates
cond2 = res.score.eq(max1).groupby(level=0).transform('sum').gt(1) 
cond2 &= res.Hour.eq(res.cus)
cond2 = cond2.groupby(level=0).any()
df['match1'] = np.where(cond1 | cond2, 'YES', 'NO')

match2:


second_largest = (res.sort_values('score', ascending=False)
                     .groupby(level=0, sort = False)
                     .score
                     .transform('nth', 1)
                  )
second_largest = second_largest.sort_index()
# second largest without duplicates
cond_1 = res.score.eq(second_largest).groupby(level=0).sum().eq(1)
cond_1 &= df.Hour.eq(df.Cus2)
# second largest with duplicates
cond_2 = res.score.eq(second_largest).groupby(level=0).transform('sum').gt(1)
cond_2 &= res.Hour.eq(res.cus)
cond_2 = cond_2.groupby(level=0).any()
df['match2'] = np.where(cond_1 | cond_2, 'YES', 'NO')
df

      ID  Hour  Cus1  Cus2  Cus3  Cus4  Cus1_Score  Cus2_Score  Cus3_Score  Cus4_Score match1 match2
0   1    11     8    10    11    14        0.62        0.59        0.59        0.54     NO    YES
1   2    13    15    16    18    13        0.57        0.57        0.57        0.57    YES    YES
2   3    16     9    14    16    12        0.67        0.54        0.48        0.34     NO     NO
3   4     8    11     8    12    17        0.58        0.55        0.43        0.25     NO    YES

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    Thank you, Sammy. There are a couple of conditions am not seeing what I wanted. Probably because my conditions are not clear enough. In the above example, for ID =2, match2 should be NO. Because the highest score already have the value which is in column hour(which is 13) so when Match1 is YES, match2 cannot be YES. I added a condition below and it's working but the key issues is when am testing on large code, I see match-1 is not working for the duplicates. It is automatically assigning "YES' to match1 even it's a NO. – Raju Oct 08 '21 at 16:53
  • 1
    Added a line before Cond2 for match1 - 'df = df[df.groupby('id')['score'].transform('max')== df['score']] ' and it worked. Thanks again for the help Sammy – Raju Oct 08 '21 at 19:18
1

You need to write a function for row operation with all the criterion that you explained. Then you can apply this function to the rows using, .iterrows, for, or .apply method which you can find them here.

Babak Fi Foo
  • 926
  • 7
  • 17