1

I have the following DataFrame:

Variable    Value  Classification
Variable_1  18  
Variable_1  25  
Variable_1  16
Variable_1  34
Variable_2  37  
Variable_2  22  
Variable_2  14  
Variable_2  26  

I'd like to assign classification values to the empty column in the table above by making a comparison with intervals/ranges defined in the table below.

Variable    Classif from    to          
Variable_1      A   17      24
Variable_1      B   25      30
Variable_1      C   31      35
Variable_2      A   10      19
Variable_2      B   20      25
Variable_2      C   26      50

The first table is just an example of the actual Dataframe (the original has more than 20k rows).

Can someone recommend an efficient way to do this? thanks in advance

Pablo
  • 167
  • 2
  • 9
  • 1. Please share your data [in a reproducible way](https://stackoverflow.com/questions/20109391). 2. Your limits are ambiguous: Does 24 for Variable 1 belong to class A or B? How to classify 20 for Variable 2? Please explain the definitions and properties of your data in detail. – Bill Huang Oct 19 '20 at 22:19
  • no images! Can you kindly look at: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples Some of the answer there will help show you how to get as text into your StackOverFlow question. – David Erickson Oct 19 '20 at 22:19
  • 1
    What will be assigned for `Variable 2` value `37`? – Michael Szczesny Oct 19 '20 at 22:20
  • Hi @BillHuang, thanks for the prompt reply, I edited the original post. – Pablo Oct 19 '20 at 22:57
  • 1
    There are many out-of-bound values ( `< from` or `> to`). How to deal with them? Also, the 4th row, 81 to 19? – Bill Huang Oct 19 '20 at 23:06

1 Answers1

1

As mentioned, there are some issues with the conditions, as only two of the values met the conditions. I add a Condition Met? column for you to visualize this, and from there you can drop that column or only keep True rows.

In the data below df is your first dataframe in your question and df2 is the second:

df2 = pd.merge(df,df1,how='left',on='Variable')
df2['Condition Met?'] = df2['Value'].between(df2['from'], df2['to'])
df2 = df2.sort_values(['Variable', 'Value', 'Condition Met?']).drop_duplicates(['Variable', 'Value'], keep='last')
# df2 = df2[df2['Condition Met?']].drop('Condition Met?', axis=1)
df2
Out[1]: 
      Variable  Value Classif  from  to  Condition Met?
0   Variable_1     18       A    17  24            True
11  Variable_1     37       C    31  35           False
8   Variable_1     54       C    31  35           False
5   Variable_1     65       C    31  35           False
16  Variable_2     22       B    20  25            True
14  Variable_2     37       C    26  50            True
23  Variable_2     66       C    26  50           False
20  Variable_2     78       C    26  50           False

After dropping Condtion Met? rows that are False as well as the column itself:

df2 = pd.merge(df,df1,how='left',on='Variable')
df2['Condition Met?'] = df2['Value'].between(df2['from'], df2['to'])
df2 = df2.sort_values(['Variable', 'Value', 'Condition Met?']).drop_duplicates(['Variable', 'Value'], keep='last')
df2 = df2[df2['Condition Met?']].drop('Condition Met?', axis=1)
df2
Out[2]: 
      Variable  Value Classif  from  to
0   Variable_1     18       A    17  24
16  Variable_2     22       B    20  25
14  Variable_2     37       C    26  50

Alternatively, you can return NaN in the Classif column if condition is not met.

df2 = pd.merge(df,df1,how='left',on='Variable')
df2['Condition Met?'] = df2['Value'].between(df2['from'], df2['to'])
df2 = df2.sort_values(['Variable', 'Value', 'Condition Met?']).drop_duplicates(['Variable', 'Value'], keep='last')
df2['Classif'] = df2['Classif'].where(df2['Condition Met?'],np.nan)
df2 = df2.drop('Condition Met?', axis=1)
df2
Out[3]: 
      Variable  Value Classif  from  to
0   Variable_1     18       A    17  24
11  Variable_1     37     NaN    31  35
8   Variable_1     54     NaN    31  35
5   Variable_1     65     NaN    31  35
16  Variable_2     22       B    20  25
14  Variable_2     37       C    26  50
23  Variable_2     66     NaN    26  50
20  Variable_2     78     NaN    26  50
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • thanks for the quick response @DavidErickson. I fixed the tables in my post. In my previous edit I missed what you have spotted. – Pablo Oct 19 '20 at 23:23