1

I am creating a new column in a dataframe based on a conditional function. In the column I am mapping from, there are multiple NaN values. If a NaN value is present in the original column, I also want this to be present in my new column. As an example, my starting point is this:

Original
0   1
1   2
2   3
3   4
4   5
5   6
6   Nan
7   8
8   9
9   10

Here is a sample of the code I initially run which gives the following result (clearly):

def get_value(range):
    if range < 2:
        return 'Below 2'
    elif range < 8:
        return 'Between 2 and 8'
    else:
        return 'Above 8'

df_sample['new_col'] = df_sample.apply(lambda x: get_value(x['Original']), axis=1)

Original    new_col
0   1.0 Below 2
1   2.0 Between 2 and 8
2   3.0 Between 2 and 8
3   4.0 Between 2 and 8
4   5.0 Between 2 and 8
5   6.0 Between 2 and 8
6   NaN Above 8
7   8.0 Above 8
8   9.0 Above 8
9   10.0    Above 8

Here, index 6 should show NaN.

I have tried including elif range == np.Nan: in my function but this didn't work.

I then tried the following based on a recommendation on Stackoverflow:

df_sample['new_col'] = df_sample.apply(lambda x: get_value(x) if(np.all(pd.notnull(x['Original']))) else x, axis = 1)

But this returns an error at the first NaN index in my dataframe.

ojp
  • 973
  • 1
  • 11
  • 26
  • This is not a recommended way of doing this, but if you insist then you could add `if np.isnan(range): return np.nan` condition – Chris Adams Mar 03 '20 at 16:55
  • @ChrisA do you mind explaining why this is not recommended? Thanks a lot. – ojp Mar 03 '20 at 21:06

3 Answers3

2

Déjà vu here, but following on from my last solution, just add default for where no condition is met:

import numpy as np 

condlist = [
    df['Original'].lt(2),
    df['Original'].lt(8),
    df['Original'].ge(8)]

choicelist = ['Below 2', 'Between 2 and 8', 'Above 8']

df['new_col'] = np.select(condlist, choicelist, default=np.nan)
print(df)

[out]

   Original          new_col
0       1.0          Below 2
1       2.0  Between 2 and 8
2       3.0  Between 2 and 8
3       4.0  Between 2 and 8
4       5.0  Between 2 and 8
5       6.0  Between 2 and 8
6       NaN              nan
7       8.0          Above 8
8       9.0          Above 8
9      10.0          Above 8
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • Thanks Chris, sorry I got stuck with this and thought it would be best to create a new question – ojp Mar 03 '20 at 17:10
1

In general, do not use apply. In thise case, cut is a much better choice:

pd.cut(df.Original, [-np.inf, 2, 8, np.inf],
       labels = ['below 2', 'between 2 and 8', 'above 8'],
       right=False)

Output:

0            below 2
1    between 2 and 8
2    between 2 and 8
3    between 2 and 8
4    between 2 and 8
5    between 2 and 8
6                NaN
7            above 8
8            above 8
9            above 8
Name: Original, dtype: category
Categories (3, object): [below 2 < between 2 and 8 < above 8]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

To comment on your code :

As you're using the else statement, everything not bellow 8 will be display as 'above 8'. Even if in your original dataset you have the string 'hello world'.

To stay as simple as your code you could do :

def get_value(range):
    if range < 2:
        return 'Below 2'
    elif range < 8:
        return 'Between 2 and 8'
    elif range >= 8:
        return 'Above 8'
    else:
        return np.nan
dantarno
  • 92
  • 1
  • 6