0

I am trying to create dummy variables based on integer comparisons in series where Nan is common. A > comparison raises errors if there are any Nan values, but I want the comparison to return a Nan. I understand that I could use fillna() to replace Nan with a value that I know will be false, but I would hope there is a more elegant way to do this. I would need to change the value in fillna() if I used less than, or used a variable that could be positive or negative, and that is one more opportunity to create errors. Is there any way to make 30 < Nan = Nan?

To be clear, I want this:

df['var_dummy'] = df[df['var'] >= 30].astype('int')

to return a null if var is null, 1 if it is 30+, and 0 otherwise. Currently I get ValueError: cannot reindex from a duplicate axis.

Drevent
  • 1
  • 1
  • 2
  • 3
    Why would `30 < NaN` give `NaN` aren't you after a boolean comparison? which in this case should be `False` – EdChum Sep 16 '16 at 15:20
  • Also, you cannot have NaNs in a boolean array. So even if you filter and do what you want to do, you will have an array of floats. – ayhan Sep 16 '16 at 15:23
  • I'm sorry, I wasn't clear. I want a statement that evaluates like if x > 30 then y = 1. I'm getting an error as soon as it evaluates Nan > 30 – Drevent Sep 16 '16 at 18:08
  • I added some code and error for clarification – Drevent Sep 16 '16 at 18:30

3 Answers3

5

Here's a way:

s1 = pd.Series([1, 3, 4, 2, np.nan, 5, np.nan, 7])
s2 = pd.Series([2, 1, 5, 5, np.nan, np.nan, 2, np.nan])

(s1 < s2).mask(s1.isnull() | s2.isnull(), np.nan)
Out: 
0    1.0
1    0.0
2    1.0
3    1.0
4    NaN
5    NaN
6    NaN
7    NaN
dtype: float64

This masks the boolean array returned from (s1 < s2) if any of them is NaN. In that case, it returns NaN. But you cannot have NaNs in a boolean array so it will be casted as float.

ayhan
  • 70,170
  • 20
  • 182
  • 203
0

Solution 1

df['var_dummy'] = 1 * df.loc[~pd.isnull(df['var']), 'var'].ge(30)

Solution 2

df['var_dummy'] = df['var'].apply(lambda x: np.nan if x!=x else 1*(x>30))

x!=x is equivalent to math.isnan()

mbh86
  • 6,078
  • 3
  • 18
  • 31
0

You can use the notna() method. Here is an example:

import pandas as pd

list1 = [12, 34, -4, None, 45]
list2 = ['a', 'b', 'c', 'd', 'e']
  
# Calling DataFrame constructor on above lists
df = pd.DataFrame(list(zip(list1, list2)), columns =['var1','letter'])

#Assigning new dummy variable:
df['var_dummy'] = df['var1'][df['var1'].notna()] >= 30
# or you can also use: df['var_dummy'] = df.var1[df.var1.notna()] >= 30

df

Will produce the below output:

    var1    letter  var_dummy
0   12.0    a       False
1   34.0    b       True
2   -4.0    c       False
3   NaN     d       NaN
4   45.0    e       True

So the new dummy variable has NaN value for the original variable's NaN rows. The only thing that does not match your request is that the dummy variable takes False and True values instead of 0 and 1, but you can easily reassign the values.

One thing, however, you cannot change is that the new dummy variable has to be float type because it contains NaN value, which by itself is a special float value. More information about NaN float are mentioned here: How can I check for NaN values?

and here: https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b

MECoskun
  • 789
  • 6
  • 12