15

I have a pandas lookup table which looks like this

Grade   Lower_Boundary  Upper_Boundary
1   -110    -96
2   -96 -91
3   -91 -85
4   -85 -81
5   -81 -77
6   -77 -72
7   -72 -68
8   -68 -63
9   -63 -58
10  -58 -54
11  -54 -50
12  -50 -46
13  -46 -42
14  -42 -38
15  -38 -34
16  -34 -28
17  -28 -18
18  -18 -11
19  -11 -11
20  -11 -9

I have another pandas dataframe that looks contains score. I want to assign 'Grade' to the score column, by looking up the look up table. So based on which interval of lower and upper boundary the score falls, the grade should be assigned from that row in the lookup table. Is there a way to do it without typing a bunch of if then else statements? I am thinking just of excel's index match.

Score   Grade
-75 6
-75 6
-60 9
-66 8
-66 8
-98 1
-60 9
-82 4
-70 7
-60 9
-60 9
-60 9
-56 10
-70 7
-70 7
-70 7
-66 8
-56 10
-66 8
-66 8
Zenvega
  • 1,974
  • 9
  • 28
  • 45
  • Iterate over the table, check whether the score is within the threshold, if so, you know your grade, otherwise continue – Sebastian Hoffmann Feb 17 '16 at 22:56
  • If you can, please you share the code? – Zenvega Feb 17 '16 at 22:57
  • I'm not familiar with pandas, but it would utterly suprise me, if it doesn't provide a way to iterate over its data structures. – Sebastian Hoffmann Feb 17 '16 at 22:59
  • 1
    I know how to iterate over the score table but using if then else conditions for the grade lookup. I was wondering if there was a more efficient way of coding that would reduce the number of lines of code. If you are familiar with excel index match, that is what I am looking for.. – Zenvega Feb 17 '16 at 23:02
  • what columns contains your second DF: Score and Grade? – MaxU - stand with Ukraine Feb 17 '16 at 23:08
  • No just the scores, I want to add Grade column to it, by lookup up the lookup table. – Zenvega Feb 17 '16 at 23:10
  • check this: http://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others – MaxU - stand with Ukraine Feb 17 '16 at 23:22
  • I checked out that link. It is really cumbersome because it creates an intermediate dataframe which is a cartesian product of the two tables, the intermediate table can get really huge, but the final table will be of the right size. Here is another link I found that would work better, but still cumbersome. It adds a separate line of code for each condition.http://stackoverflow.com/questions/21733893/pandas-dataframe-add-a-field-based-on-multiple-if-statements – Zenvega Feb 17 '16 at 23:45
  • Did my solution work? – IanS Feb 24 '16 at 04:55
  • Hi IanS. Yes your solution did work. Thank you! – Zenvega Feb 25 '16 at 16:49

2 Answers2

13

A one-line solution (I call your lookup table lookup):

df['Score'].apply(lambda score: lookup['Grade'][(lookup['Lower_Boundary'] <= score) & (lookup['Upper_Boundary'] > score)].values[0])

Explanation:

For a given score, here is how to find the grade:

score = -75
match = (lookup['Lower_Boundary'] <= score) & (lookup['Upper_Boundary'] > score)
grade = lookup['Grade'][match]

This return a series of length 1. You can get its value with, for instance:

grade.values[0]

All you need to do is apply the above to the score column. If you want a one-liner, use a lambda function:

df['Score'].apply(lambda score: lookup['Grade'][(lookup['Lower_Boundary'] <= score) & (lookup['Upper_Boundary'] > score)].values[0])

Otherwise the following would be more readable:

def lookup_grade(score):
    match = (lookup['Lower_Boundary'] <= score) & (lookup['Upper_Boundary'] > score)
    grade = lookup['Grade'][match]
    return grade.values[0]

df['Score'].apply(lookup_grade)

This approach would also make it easier to deal with cases when no match is found.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • I'm trying to use this to get a process to work using 2 input fields. I don't understand how score is being passed to lookup_grade() with df['Score'].apply(lookup_grade)? IIsn't the apply being applied to the 'Score' column? – MrKingsley Jan 23 '23 at 17:46
  • Yes, `apply` will apply the function `lookup_grade` to every value in the column successively. For two input fields (columns) it would look something like: `df.apply(lookup_function, axis=1)` where lookup_function takes a row (a pandas series) as input, so you would access the values as `row['field1']` and `row['field2']`. – IanS Jan 25 '23 at 09:45
  • Ahhh, yes! I managed to get this working df1['result'] = df1.apply(lambda x: lookupdf(x['val1'], x['val2']), axis=1) lookup being a defined function asking for 2 inputs, in this case columns of df1, or the values in said columns. Having trouble getting it into a one liner though; would using a one liner make it any faster vs using the defined function? – MrKingsley Jan 25 '23 at 13:56
  • No, a one-liner only makes it easier to read (hopefully). Defining a function is perfectly fine. – IanS Jan 26 '23 at 16:07
  • 1
    Ok, then I will leave it as is; thanks for the help. – MrKingsley Jan 27 '23 at 14:15
0

One option is with the conditional_join from pyjanitor, and also covers overlapping intervals:

# pip install pyjanitor
import pandas as pd
import janitor

(df2
.conditional_join(
    df1, 
    ('Score', 'Lower_Boundary', '>='), 
    ('Score', 'Upper_Boundary', '<='), 
    how = 'left')
.loc[:, ['Score', 'Grade']]
)

    Score  Grade
0     -75      6
1     -75      6
2     -60      9
3     -66      8
4     -66      8
5     -98      1
6     -60      9
7     -82      4
8     -70      7
9     -60      9
10    -60      9
11    -60      9
12    -56     10
13    -70      7
14    -70      7
15    -70      7
16    -66      8
17    -56     10
18    -66      8
19    -66      8

You could achieve this also with pd.IntervalIndex and it should be faster, especially if your data does not have overlapping intervals.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31