4

How to create dummy variables in Pandas (Python 2.7) has been asked many times, but I dont know a robust and fast solution yet. Consider this dataframe:

df=pd.DataFrame({'A':[1,2,-1,np.nan, 'rh']})
df
Out[9]: 
     A
0    1
1    2
2   -1
3  NaN
4   rh

yes, it has mixed types. Happens all the time with big datasets (I have millions of rows)

I need to create dummy variables that are 1 if a condition is true, and zero otherwise. I am assuming that if Pandas cannot perform the logical comparison (say comparing whether a string is larger than some real number), I would get a zero. Look at this instead:

df['dummy2']=(df.A > 0).astype(int)

df['dummy1']=np.where(df.A>0,1,0)

df
Out[12]: 
     A  dummy2  dummy1
0    1       1       1
1    2       1       1
2   -1       0       0
3  NaN       0       0
4   rh       1       1

Clearly this is problematic. What is happening here? How can I prevent these false flags?

Many thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

2 Answers2

4

Two ways you could do

In [37]: pd.to_numeric(df.A, errors='coerce').notnull() & (df.A > 0)
Out[37]:
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool

In [38]: df.A.apply(np.isreal) & (df.A > 0)
Out[38]:
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool

Third could perhaps be slow

In [39]: df.A.str.isnumeric().isnull() & (df.A > 0)
Out[39]:
0     True
1     True
2    False
3    False
4    False
Name: A, dtype: bool
Zero
  • 74,117
  • 18
  • 147
  • 154
2

Update: @JohnGalt pointed out in the comments a better way would be to use pd.to_numeric with errors='coerce':

# Your condition here, instead of `> 0`, using the fact that NaN > 0 == false
[18]: df['dummy1'] = (pd.to_numeric(df.A, errors='coerce').notnull() > 0).astype('int')
[19]: df
Out[19]:
     A  dummy1
0    1       1
1    2       1
2   -1       0
3  NaN       0
4   rh       0

The best way One general way to create such dummy variables will be along these lines:

def foo(a):
    try:
        tmp = int(a)
        return 1 if tmp > 0 else 0 # Your condition here.
    except:
        return 0

[12]: df.A.map(foo)
Out[12]:
0    1
1    1
2    1
3    0
4    0
Name: A, dtype: int64

You are doing the operations in Python 2.7, where comparisons between str and int are (unfortunately) allowed. The operations fail on Python 3:

 [5]: df.A > 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-890e73655a37> in <module>()
----> 1 df.A > 0

/home/utkarshu/miniconda3/envs/py35/lib/python3.5/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
    724                 other = np.asarray(other)
    725
--> 726             res = na_op(values, other)
    727             if isscalar(res):
    728                 raise TypeError('Could not compare %s type with Series'

/home/utkarshu/miniconda3/envs/py35/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
    646                     result = lib.vec_compare(x, y, op)
    647             else:
--> 648                 result = lib.scalar_compare(x, y, op)
    649         else:
    650

pandas/lib.pyx in pandas.lib.scalar_compare (pandas/lib.c:14186)()

TypeError: unorderable types: str() > int()
Community
  • 1
  • 1
musically_ut
  • 34,028
  • 8
  • 94
  • 106