17

I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is greater than the threshold, False means that it is less, and NaN values are maintained.

If there are no NaN values, it takes about 60 ms for me to run:

df >= threshold

But when I try to deal with the NaNs, the below method works, but is very slow (20 sec).

def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))

Is there a faster way?

jsignell
  • 3,072
  • 1
  • 22
  • 23
  • Try to replace your `func` with this line: `return x >= threshold if x is not None else x`, it might be faster. BTW why did you assign two `lambda x`? `df.apply(func)` will do the trick. – DeepSpace Nov 18 '15 at 14:58
  • @DeepSpace that took the same time – jsignell Nov 18 '15 at 15:05

5 Answers5

11

You can do:

new_df = df >= threshold
new_df[df.isnull()] = np.NaN

But that is different from what you will get using the apply method. Here your mask has float dtype containing NaN, 0.0 and 1.0. In the apply solution you get object dtype with NaN, False, and True.

Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN!

The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.

ocefpaf
  • 569
  • 4
  • 15
  • "the apply method is implicitly violates that by returning NaN" - ouch! Any more recent attempts to address that problem? – jtlz2 Mar 14 '23 at 13:00
3

You can check for NaNs separately using this post: Python - find integer index of rows with NaN in pandas

df.isnull()

Combine the output of isnull with df >= threshold using bitwise or:

df.isnull() | df >= threshold

You can expect the two masks to take closer to 200ms to compute and combine, but that should be far enough away from 20s to be OK.

Community
  • 1
  • 1
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Do you have ideas about how to combine them? That is the path I think I need to go down too. – jsignell Nov 18 '15 at 15:09
  • 1
    This did not work for me. I tried it in python 2.7.1, pandas 0.17.0 (what I usually use) and got a NotImplementedError, then I tried it in python 3.4.3, pandas 0.17.0 and got: 'bitwise_or' not supported for the input type – jsignell Nov 19 '15 at 13:23
  • Try using `np.logical_or(df.isnull(), df >= threshold)` instead. Here is a notebook with my timings: http://nbviewer.ipython.org/gist/ocefpaf/4539348e5ed71f7fe94f – ocefpaf Nov 19 '15 at 13:55
  • OK I missed the `NaN values are maintained` part. This is not pretty and it is still slow (but faster than apply): `df = df_nans >= threshold` `df[df_nans.isnull()] = np.NaN` – ocefpaf Nov 19 '15 at 14:03
2

Another option is to use mask:

df.mask(~df.isna(), df >= threshold)

This will only apply the condition to non-nan values and leave the nan values untouched

Bruno Carballo
  • 1,156
  • 8
  • 15
1

In this situation I use an indicator array of floats, coded as: 0=False, 1=True, and NaN=missing. A Pandas DataFrame with bool dtype cannot have missing values, and a DataFrame with object dtype holding a mix of Python bool and float objects is not efficient. This leads us to using DataFrames with np.float64 dtype. numpy.sign(x - threshold) gives -1 = (x < threshold), 0 = (x == threshold) and +1 = (x > threshold) for your comparison, which might be good enough for your purposes, but if you really need 0/1 coding, the conversion can be made in-place. Timings below are on a 200K length array x:

In [45]: %timeit y = (x > 0); y[pd.isnull(x)] = np.nan
100 loops, best of 3: 8.71 ms per loop

In [46]: %timeit y = np.sign(x)
100 loops, best of 3: 1.82 ms per loop

In [47]: %timeit y = np.sign(x); y += 1; y /= 2
100 loops, best of 3: 3.78 ms per loop
Kerby Shedden
  • 614
  • 4
  • 9
  • 1
    I should have mentioned that all three approaches above give you a DataFrame y with dtype `np.float64`, and all preserve NaN's. The second approach gives -1/1 coding for False/True and the others give 0/1 coding. `y = (1 + np.sign(x)) / 2` is also competitive. – Kerby Shedden Nov 28 '15 at 03:59
  • 1
    This may not give you what you want when there is exact equality. `np.sign(x - threshold)` will be 0 if `x == threshold` so in the final result you will get 0 if x < threshold 1/2 if x==threshold and 1 if x > threshold. If equality is a possibility, you could go with `y = (1 + np.sign(eps + x - threshold)) / 2`, where `eps = np.finfo(np.float64).eps`. – Kerby Shedden Nov 28 '15 at 04:13
0

This will keep null values, and transform the rest to boolean depending on whether they are greater than the specified threshold:

from pandas import isnull

df["my_score"].apply(lambda score: score if isnull(score) else score >= threshold)
s2t2
  • 2,462
  • 5
  • 37
  • 47