3

I have a dataframe with 3 columns: a_id, b, c (with a_id as a unique key) and I would like to assign a score for each row based on the number in b and c columns. I have created the following:

def b_score_function(df):
    if df['b'] <= 0 :
        return 0
    elif df['b'] <= 2 :
        return 0.25
    else: 
        return 1

def c_score_function(df): 
    if df['c'] <= 0 :
        return 0
    elif df['c'] <= 1 :
        return 0.5
    else: 
        return 1

Normally, I would use something like this:

df['b_score'] = df(b_score, axis = 1)
df['c_score'] = df(c_score, axis = 1)

However, the above approach will be too long if I have multiple columns. I would like to know how can I create a loop for the selected columns? I have tried the following:

ds_cols = df.columns.difference(['a_id']).to_list() 

for col in ds_cols:
    df[f'{col}_score'] = df.apply(f'{col}_score_function', axis = 1)

but it returned with the following error:

'b_score_function' is not a valid function for 'DataFrame' object

Can anyone please point out what I did wrong? Also if anyone can suggest how to create a reusable, that would be appreciated.

Thank you.

Deysiz
  • 31
  • 3
  • How many different column-specific functions do you have in your real data? Is it just two (b/c) or more? – Tom Jul 23 '22 at 04:00
  • Have you tried something like this `df['b_score'] = np.where(df.b<=0, 0,np.where(df.b<=2,0.25, 1))` and for the other one `df['c_score'] = np.where(df.c<=0, 0,np.where(df.c<=1,0.5, 1))` – XXavier Jul 23 '22 at 04:10
  • The problem with using np.where is I need to always rewrite the case. I would like to be able to reuse the function and would prefer to update in one go. – Deysiz Jul 23 '22 at 04:30

3 Answers3

1

IIUC, this should work for you:

df = pd.DataFrame({'a_id': range(5), 'b': [0.0, 0.25, 0.5, 2.0, 2.5], 'c': [0.0, 0.25, 0.5, 1.0, 1.5]})

def b_score_function(df):
    if df['b'] <= 0 :
        return 0
    elif df['b'] <= 2 :
        return 0.25
    else: 
        return 1

def c_score_function(df): 
    if df['c'] <= 0 :
        return 0
    elif df['c'] <= 1 :
        return 0.5
    else: 
        return 1


ds_cols = df.columns.difference(['a_id']).to_list() 
for col in ds_cols:
    df[f'{col}_score'] = df.apply(eval(f'{col}_score_function'), axis = 1)
print(df)

Result:

   a_id     b     c  b_score  c_score
0     0  0.00  0.00     0.00      0.0
1     1  0.25  0.25     0.25      0.5
2     2  0.50  0.50     0.25      0.5
3     3  2.00  1.00     0.25      0.5
4     4  2.50  1.50     1.00      1.0
René
  • 4,594
  • 5
  • 23
  • 52
  • Thanks! I am wondering if there is a way to iterate the score_funs because I have many columns and functions. I looked around but I couldn't seem to find the answer (especially because of the key) – Deysiz Jul 23 '22 at 05:59
  • I updated my answer and hope this works for you. – René Jul 23 '22 at 06:34
0

The problem with your attempt is that pandas cannot access your functions from strings with the same name. For example, you need to pass df.apply(b_score_function, axis=1), and not df.apply("b_score_function", axis=1) (note the double quotes).

My first thought would be to link the column names to functions with a dictionary:

funcs = {'b' : b_score_function,
         'c' : c_score_function}

for col in ds_cols:
    foo = funcs[col]
    df[f'{col}_score'] = df.apply(foo, axis = 1)

Typing out the dictionary funcs may be tedious or infeasible depending on how many columns/functions you have. If that is the case, you may have to find additional ways to automate the creation and access of your column-specific functions.

One somewhat automatic way is to use locals() or globals() - these will return dictionaries which have the functions you defined (as well as other things):

for col in ds_cols:
    key = f"{col}_score_function"
    foo = locals()[key]
    df.apply(foo, axis=1)

This code is dependent on the fact that the function for column "X" is called X_score_function(), but that seems to be met in your example. It also requires that every column in ds_cols will have a corresponding entry in locals().


Somewhat confusingly there are some functions which you can access by passing a string to apply, but these are only the ones that are shortcuts for numpy functions, like df.apply('sum') or df.apply('mean'). Documentation for this appears to be absent. Generally you would want to do df.sum() rather than df.apply('sum'), but sometimes being able to access the method by the string is convenient.

Tom
  • 8,310
  • 2
  • 16
  • 36
  • Thank you for your fast response. Yes, I have more than a dozen columns. I will think of a way to improve mine by creating a dictionary – Deysiz Jul 23 '22 at 04:29
  • @Deysiz you could try using `locals()` or `globals()`, see my edit – Tom Jul 23 '22 at 13:04
0

For a vectorial way in a single shot, you can use dictionaries to hold the threshold and replacement values, then numpy.select:

# example input
df = pd.DataFrame({'b': [-1, 2, 5],
                   'c': [5, -1, 1]})

# dictionaries (one key:value per column)
thresh = {'b': 2, 'c': 1}
repl = {'b': 0.25, 'c': 0.5}

out = pd.DataFrame(
    np.select([df.le(0), df.le(thresh)],
              [0, pd.Series(repl)],
              1),
    columns=list(thresh),
    index=df.index
).add_suffix('_score')

output:

   b_score  c_score
0     0.00      1.0
1     0.25      0.0
2     1.00      0.5
mozway
  • 194,879
  • 13
  • 39
  • 75