python pandas Ignore Nan in integer comparisons

Question

I am trying to create dummy variables based on integer comparisons in series where Nan is common. A > comparison raises errors if there are any Nan values, but I want the comparison to return a Nan. I understand that I could use fillna() to replace Nan with a value that I know will be false, but I would hope there is a more elegant way to do this. I would need to change the value in fillna() if I used less than, or used a variable that could be positive or negative, and that is one more opportunity to create errors. Is there any way to make 30 < Nan = Nan?

To be clear, I want this:

df['var_dummy'] = df[df['var'] >= 30].astype('int')

to return a null if var is null, 1 if it is 30+, and 0 otherwise. Currently I get ValueError: cannot reindex from a duplicate axis.

Why would `30 < NaN` give `NaN` aren't you after a boolean comparison? which in this case should be `False` — EdChum, Sep 16 '16 at 15:20
Also, you cannot have NaNs in a boolean array. So even if you filter and do what you want to do, you will have an array of floats. — ayhan, Sep 16 '16 at 15:23
I'm sorry, I wasn't clear. I want a statement that evaluates like if x > 30 then y = 1. I'm getting an error as soon as it evaluates Nan > 30 — Drevent, Sep 16 '16 at 18:08

score 5 · Answer 1 · answered Sep 16 '16 at 15:28

Here's a way:

s1 = pd.Series([1, 3, 4, 2, np.nan, 5, np.nan, 7])
s2 = pd.Series([2, 1, 5, 5, np.nan, np.nan, 2, np.nan])

(s1 < s2).mask(s1.isnull() | s2.isnull(), np.nan)
Out: 
0    1.0
1    0.0
2    1.0
3    1.0
4    NaN
5    NaN
6    NaN
7    NaN
dtype: float64

This masks the boolean array returned from (s1 < s2) if any of them is NaN. In that case, it returns NaN. But you cannot have NaNs in a boolean array so it will be casted as float.

score 0 · Answer 2 · answered Jan 17 '20 at 09:47

0

Solution 1

df['var_dummy'] = 1 * df.loc[~pd.isnull(df['var']), 'var'].ge(30)

Solution 2

df['var_dummy'] = df['var'].apply(lambda x: np.nan if x!=x else 1*(x>30))

x!=x is equivalent to math.isnan()

answered Jan 17 '20 at 09:47

mbh86

6,078
3
18
31

MECoskun · Answer 3 · 2020-10-06T20:59:26.763

You can use the notna() method. Here is an example:

import pandas as pd

list1 = [12, 34, -4, None, 45]
list2 = ['a', 'b', 'c', 'd', 'e']
  
# Calling DataFrame constructor on above lists
df = pd.DataFrame(list(zip(list1, list2)), columns =['var1','letter'])

#Assigning new dummy variable:
df['var_dummy'] = df['var1'][df['var1'].notna()] >= 30
# or you can also use: df['var_dummy'] = df.var1[df.var1.notna()] >= 30

df

Will produce the below output:

    var1    letter  var_dummy
0   12.0    a       False
1   34.0    b       True
2   -4.0    c       False
3   NaN     d       NaN
4   45.0    e       True

So the new dummy variable has NaN value for the original variable's NaN rows. The only thing that does not match your request is that the dummy variable takes False and True values instead of 0 and 1, but you can easily reassign the values.

One thing, however, you cannot change is that the new dummy variable has to be float type because it contains NaN value, which by itself is a special float value. More information about NaN float are mentioned here: How can I check for NaN values?

and here: https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b

python pandas Ignore Nan in integer comparisons

3 Answers3