Why is changing values in a column of a pandas data frame fast in one case and slow in another one?

Question

I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one.

This is the first piece:

t1 = time.time()
df[new_col] = np.where(df[col] < j, val_1, val_2)
t2 = time.time()
ts.append(t2 - t1)

In ts I have values like:

0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344

In contrast, this part of the code:

t1 = time.time()
df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col)
t2 = time.time()
ts.append(t2 - t1)

Creates ts populated with the values like:

0.11008906364440918, 0.09556794166564941, 0.08580684661865234

I cannot figure out what the essential difference is between the first and second assignments.

In both cases df should be the same.

ADDED

It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had:

df = inp_df.copy()

in the beginning of the class method (where inp_df was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.

Try to pre-compute the where condition and only time the call to np.where and the assignment to df[new_col]. What do you see? — BlackBear, Dec 05 '18 at 13:16

score 5 · Answer 1 · answered Dec 05 '18 at 13:13

First time you use only one condition so it should be faster than you do check the two conditions. Simple example use ipython:

In [3]: %timeit 1 < 2                                                                                                                                         
20.4 ns ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit 1 >= 0 & 1 < 2                                                                                                                                
37 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

jpp · Accepted Answer · 2018-12-05T13:34:20.380

Assignment is not the bottleneck

Assigning values to Pandas series is cheap, especially if you are assigning via regular objects such as pd.Series, np.ndarray or list.

Broadcasting is even cheaper

Note broadcasting is extremely cheap, i.e. when you are setting scalar values such as val_1 and val_2 in the first example.

Your second example has a series assignment for the case where your condition is not met. This is relatively expensive.

Calculations are relatively expensive

On the other hand, the calculations you perform are relatively expensive.

In the first example, you have one calculation:

df[col] < j

In the second example, you have at least three calculations:

a = df[col] >= i1
b = df[col] < i2
a & b

Therefore, you can and should expect the second version to be more expensive.

Use `timeit`

It's good practice to use the timeit module for reliable performance timings. The reproducible example below shows a smaller performance differential than what you claim:

import pandas as pd, numpy as np

np.random.seed(0)
df = pd.DataFrame({'A': np.random.random(10**7)})

j = 0.5
i1, i2 = 0.25, 0.75

%timeit np.where(df['A'] < j, 1, 2)                             # 85.5 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, df['A'])  # 161 ms per loop

One calculation is cheaper than 3 calculations:

%timeit df['A'] < j                                             # 14.8 ms per loop
%timeit (df['A'] >= i1) & (df['A'] < i2)                        # 65.6 ms per loop

Broadcasting via scalar values is cheaper than assigning series:

%timeit np.where(df['A'] < j, 1, df['A'])                       # 113 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, 2)        # 146 ms per loop

Why is changing values in a column of a pandas data frame fast in one case and slow in another one?

2 Answers2

Assignment is not the bottleneck

Broadcasting is even cheaper

Calculations are relatively expensive

Use timeit

Use `timeit`