Apply a conditional function to billions of data points

Question

I have a dataframe that looks like this:

idx    a      b      c      d      e      f      g      h      i       j
1      0     17     17     83     17      0     21     16     21       4
2     -9     31     31     74     40      0     39     39     39       9
3    -27      0    -27     92     27    -37      3    -37     40      16
4     -4      0     -4     81      4     -1      5      5      6       9

I'd like to apply:

where x>0: functionA(x)

where x<0: functionB(x)

What I've tried independently:

df[df>0] = np.log(df)

and

df[df<0] = -np.log(-df)

Which kinda seems to work.. Running these two ops sequentially will not work because the dataframe converts from int to float after the first operation and renders original values un-differentiable from log values, ex. is it a 0 or log(1) = 0 ?

I'm also concerned about these errors:

Divide by zero

usr/local/anaconda3/envs/ds/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log
  """Entry point for launching an IPython kernel.```

Invalid value

/usr/local/anaconda3/envs/ds/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in log
      """Entry point for launching an IPython kernel.

Which shouldn't occur because there are no NaN values and I'm explicitly selecting non zero values.

df.isnull().values.any()
False

The final issue is how to do this efficiently as I'm working with billions of rows.

What is the problem with the numbers being converted to float? — Antimony, Sep 09 '17 at 20:50
Numbers might be close to zero which messes with floating point math. Its better to compute abs(x-0) < eps, where eps is small: around 10^-5. Because of this, I think its best to compute the two operations on separate int dataframes and then merge them. — encore2097, Sep 09 '17 at 21:59
I don't get the conversion to float part. I understand that log can compute really small values, but since you're complaining about getting `float` then what's the purpose of doing `log`? You could always convert the whole dataframe to `float32` while loading/creating — Colonder, Sep 09 '17 at 23:08
There no complaint about going to float. The issue is about applying the two operations on the same dataframe. After one operation the df is mutated into a float32 with log values. Then its no longer possible to differentiate original values vs log values, i.e. is value 0 or log(1) = 0? I've removed that part from the title as its unnecessary and confusing. — encore2097, Sep 09 '17 at 23:13

AGN Gazer · Accepted Answer · 2017-09-10T00:51:02.447

1

You can use the numpy.piecewise function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.piecewise.html

import numpy as np
positive = df.values > 0
negative = df.values < 0
df[:] = np.piecewise(df.values, (positive, negative), (np.log, lambda x: -np.log(-x)))

edited Sep 10 '17 at 00:51

answered Sep 10 '17 at 00:04

AGN Gazer

8,025
2
27
45

Seems like it should work but I'm getting this error: `ValueError: cannot copy sequence with size 54572193 to array axis with dimension 10` – encore2097 Sep 10 '17 at 00:18
`negative = df.values < 0` (logical_not of > is <= which includes 0. This seems to work, though have to reconvert to df. Thanks! Here the code I tried: `positive = df.values > 0 negative = df.values < 0 df = np.piecewise(df.values, [positive, negative], [lambda x: np.log(x), lambda x: -np.log(-x)])` – encore2097 Sep 10 '17 at 00:39
@encore2097 Sorry about inclusion of 0s into negatives. I fixed this in the latest edit. I also added how to assign values of the new `numpy` array back to your `dataframe`. – AGN Gazer Sep 10 '17 at 00:49

score 0 · Answer 2 · answered Sep 09 '17 at 23:06

There's probably a better way but for now here's what I did:

Broke down my columns into three types:

(-inf,0)
(0, inf)
(-inf, inf)

The first two are straightforward [1]:

for i in [a, ...]:
    s = df[i]
    df[i] = np.where(s<0, -np.log(-s), s).astype('float32')

Similar code for type 2.

Type 3 was trickier and slower:

def apply_log(x):
    if x>0:
        return np.log(x)
    elif x<0:
        return -np.log(-x)
    elif x == 0:
        return 0.0
    else:
        assert False

Then vectorize it [2]

veclog = np.vectorize(apply_log)

Then run it: df['c'] = veclog(s.astype('float32')).astype('float32')

Runtime on ~50M subset: 57.7 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

From [3] with np.where , the fn is applied before the condition, hence the divide by zero error. Type 1 and 2 throw the divide by zero warnings/errors, no errors on Type 3.

Sources:

[1] Python: numpy/pandas change values on condition

[2] Function application over numpy's matrix row/column

[3] RuntimeWarning: divide by zero encountered in log

encore2097 · Answer 3 · 2017-09-10T00:49:45.093

Someone added this answer before and then deleted it:

df = np.log(df.where(df>0)).fillna(-1*np.log(-1*df.where(df<0))).fillna(0)

Which uses a chunk of memory but appears to work. ~~I have a suspicion it was deleted because it runs the ops in sequence and might clobber some values.~~

Update: This seems to match the answer of other solutions within .0005. Would like if the original poster would re-post their answer!

Apply a conditional function to billions of data points

3 Answers3