3

I'm trying to create a flag variable (i.e. a new column with binary values, like 1 for True, 0 for False) - I've tried both np.where (as per this post) and df.where to no avail.

With df.where using:

df.where(((df['MOSL_Rating'] == 'Highly Effective') & (df['MOTP_Rating'] == 'Developing')) | ((df['MOSL_Rating'] == 'Highly Effective') & (df['MOTP_Rating'] == 'Ineffective')) | ((df['MOSL_Rating'] == 'Effective') & (df['MOTP_Rating'] == 'Ineffective')) | ((df['MOSL_Rating'] == 'Ineffective') & (df['MOTP_Rating'] == 'Highly Effective')) | ((df['MOSL_Rating'] == 'Ineffective') & (df['MOTP_Rating'] == 'Effective')) | ((df['MOSL_Rating'] == 'Developing') & (df['MOTP_Rating'] == 'Highly Effective')), df['disp_rating'], 1, axis=1)

but this returns ValueError: For argument "inplace" expected type bool, received type int.

If I change my code from df['disp_rating'], 1, axis=1 to df['disp_rating'], True, axis=1 it returns TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

I've also tried np.where but that returns ValueError: either both or neither of x and y should be given

I've also read this question, which looks similar. However when I use the solution presented there it returns: KeyError: 'disp_rating'

If I create the variable in advance (to avoid the Key Error) I just get another error about something else.

I assumed it would be pretty simple to create a new variable based on some basic conditions, but I've been stuck on this for a while and am not really making any progress despite reading the documentation and lots of SO posts.

edit: Just to be extra clear, I'm trying to create a new column (named 'disp_rating') based on whether or not the values in 2 other columns ('MOSL_Rating and 'MOTP_Rating') within the same df meet certain conditions. I only have 1 dataframe, so I'm not trying to compare 2 dataframes. In SQL I would use a CASE WHEN statement, in SAS I would use an IF/THEN/ELSE statement.

My df generally looks like this:

ID  Loc  MOSL_rating MOTP_Rating
12  54X  D           E   
45  86I  D           I    
98  65R  H           H  
LMGagne
  • 1,636
  • 6
  • 24
  • 47
  • could you give a snippet of code to create a df with the correct columns ? – Romain Jouin Mar 13 '18 at 21:03
  • Are you trying to compare two dataframes? Are you creating a new column? This question isn't clear. – Nanda Mar 13 '18 at 21:07
  • `df[df['MOSL_Rating'] == 'Highly Effective' & df['MOTP_Rating'] == 'Developing' ]` -- if you want to select rows based on your condition – Nanda Mar 13 '18 at 21:09
  • What do you try to do ? you should have to distincts dataframe – Romain Jouin Mar 13 '18 at 21:56
  • Updated post to explain further - I'm trying to create a new column based on whether or not the values in 2 other columns within the same df meet certain conditions. I only have 1 dataframe, so I'm not trying to compare 2 dataframes. – LMGagne Mar 14 '18 at 13:31
  • @LMGagne, did one of the below solutions help? if so, feel free to accept one, or ask for clarification. – jpp Aug 29 '19 at 09:31

2 Answers2

1

Your logic is overcomplicated and can be simplified / optimized via set. Below is a demonstration.

d = {frozenset({'H', 'D'}),
     frozenset({'H', 'I'}),
     frozenset({'E', 'I'})}

df['MOSL_MOTP'] = list(map(frozenset, zip(df['MOSL_Rating'], df['MOTP_Rating'])))
df['Result'] = np.where(df['MOSL_MOTP'].isin(d), 1, 0)

#    ID  Loc MOSL_Rating MOTP_Rating MOSL_MOTP  Result
# 0  12  54X           D           E    (E, D)       0
# 1  45  86I           D           I    (D, I)       0
# 2  98  65R           H           H       (H)       0
# 3  95  66R           H           D    (D, H)       1
# 4  96  67R           D           H    (D, H)       1
# 5  97  68R           E           I    (E, I)       1
jpp
  • 159,742
  • 34
  • 281
  • 339
0

I couldn't find why where doesn't work, but here is a way around :

Creating a code to create your df :

def make_row():
    import random
    dico = {"MOSL_Rating" : ['Highly Effective', 'Effective', 'Ineffective', 'Developing'],
            "MOTP_Rating" : ['Developing', 'Ineffective', 'Highly Effective', 'Effective', 'Highly Effective'],
           "disp_rating" : range(100)}

    row = {}
    for k in dico.keys():
        v = random.choice(dico[k])
        row[k] =v
    return row

def make_df(nb_row):
    import pandas as pd
    rows = [make_row() for i in range(nb_row)]
    return pd.DataFrame(rows)

I can create a df :

df = make_df(3)

    MOSL_Rating MOTP_Rating disp_rating
0   Highly Effective    Ineffective 39
1   Highly Effective    Highly Effective    71
2   Effective   Ineffective 95

and a seconde one :

df2 = make_df(3)
df2
    MOSL_Rating MOTP_Rating disp_rating
0   Effective   Highly Effective    24
1   Effective   Developing  38
2   Highly Effective    Ineffective 16

I then create your tests :

MOSL_high_efective   = df['MOSL_Rating'] == 'Highly Effective'
MOSL_efective        = df['MOSL_Rating'] == 'Effective'
MOSL_inefective      = df['MOSL_Rating'] == 'Ineffective'
MOSL_developing      = df['MOSL_Rating'] == 'Developing'

MOTP_high_efective   = df['MOTP_Rating'] == 'Highly Effective'
MOTP_efective        = df['MOTP_Rating'] == 'Effective'
MOTP_inefective      = df['MOTP_Rating'] == 'Ineffective'
MOTP_developing      = df['MOTP_Rating'] == 'Developing'

test1 = MOSL_high_efective & MOTP_developing
test2 = MOSL_high_efective & MOTP_inefective
test3 = MOSL_efective      & MOTP_inefective
test4 = MOSL_inefective    & MOTP_high_efective
test5 = MOSL_inefective    & MOTP_efective
test6 = MOSL_developing    & MOTP_high_efective

conditions  = test1 | test2 |  test3 | test4 | test5 | test6

and then update the values of first dataframe by the second one where the conditions are meet :

    lines_to_be_updates = df.loc[conditions].index.values
    df.loc[lines_to_be_updates, "disp_rating"] = df2[lines_to_be_updates]["disp_rating"]

df
    MOSL_Rating MOTP_Rating disp_rating
0   Highly Effective    Ineffective 24
1   Highly Effective    Highly Effective    71
2   Effective   Ineffective 16
Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
  • I don't understand how the values in the disp_rating column relate to the values of the other columns. The 0-100 range is confusing. And why are you making 2 dataframes? – LMGagne Mar 14 '18 at 13:37