6

I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one.

This is the first piece:

t1 = time.time()
df[new_col] = np.where(df[col] < j, val_1, val_2)
t2 = time.time()
ts.append(t2 - t1) 

In ts I have values like:

0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344

In contrast, this part of the code:

t1 = time.time()
df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col)
t2 = time.time()
ts.append(t2 - t1)

Creates ts populated with the values like:

0.11008906364440918, 0.09556794166564941, 0.08580684661865234

I cannot figure out what the essential difference is between the first and second assignments.

In both cases df should be the same.

ADDED

It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had:

df = inp_df.copy()

in the beginning of the class method (where inp_df was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.

Boann
  • 48,794
  • 16
  • 117
  • 146
Roman
  • 124,451
  • 167
  • 349
  • 456
  • Try to pre-compute the where condition and only time the call to np.where and the assignment to df[new_col]. What do you see? – BlackBear Dec 05 '18 at 13:16

2 Answers2

5

First time you use only one condition so it should be faster than you do check the two conditions. Simple example use ipython:

In [3]: %timeit 1 < 2                                                                                                                                         
20.4 ns ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit 1 >= 0 & 1 < 2                                                                                                                                
37 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Brown Bear
  • 19,655
  • 10
  • 58
  • 76
5

Assignment is not the bottleneck

Assigning values to Pandas series is cheap, especially if you are assigning via regular objects such as pd.Series, np.ndarray or list.

Broadcasting is even cheaper

Note broadcasting is extremely cheap, i.e. when you are setting scalar values such as val_1 and val_2 in the first example.

Your second example has a series assignment for the case where your condition is not met. This is relatively expensive.

Calculations are relatively expensive

On the other hand, the calculations you perform are relatively expensive.

In the first example, you have one calculation:

df[col] < j

In the second example, you have at least three calculations:

a = df[col] >= i1
b = df[col] < i2
a & b

Therefore, you can and should expect the second version to be more expensive.

Use timeit

It's good practice to use the timeit module for reliable performance timings. The reproducible example below shows a smaller performance differential than what you claim:

import pandas as pd, numpy as np

np.random.seed(0)
df = pd.DataFrame({'A': np.random.random(10**7)})

j = 0.5
i1, i2 = 0.25, 0.75

%timeit np.where(df['A'] < j, 1, 2)                             # 85.5 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, df['A'])  # 161 ms per loop

One calculation is cheaper than 3 calculations:

%timeit df['A'] < j                                             # 14.8 ms per loop
%timeit (df['A'] >= i1) & (df['A'] < i2)                        # 65.6 ms per loop

Broadcasting via scalar values is cheaper than assigning series:

%timeit np.where(df['A'] < j, 1, df['A'])                       # 113 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, 2)        # 146 ms per loop
jpp
  • 159,742
  • 34
  • 281
  • 339