Why is filtering DataFrame by boolean mask so much faster than apply()?

Question

I want to compare the performance between 2 different methods to filter pandas DataFrames. So I created a test set with n points in the plane and I filter out all points which are not in the unit square. I am surprised one method is so much faster than the other. The larger n becomes the bigger the difference. What would be the explanation for that?

This is my script

import numpy as np
import time
import pandas as pd


# Test set with points
n              = 100000
test_x_points  = np.random.uniform(-10, 10, size=n)
test_y_points  = np.random.uniform(-10, 10, size=n)
test_points    = zip(test_x_points, test_y_points)
df             = pd.DataFrame(test_points, columns=['x', 'y'])


# Method a
start_time     = time.time()
result_a       = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time       = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)


# Method b
start_time     = time.time()
result_b       = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time       = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)


# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)

Results for different values of n:

For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.

For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.

For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.

For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.

For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.

For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.

When I compare it to Python native list comprehension method a is still much faster

result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]

Why is that?

`df.apply(..., axis=1)` - is a `for loop` under the hood - i.e. it's NOT vectorized — MaxU - stand with Ukraine, Jan 18 '18 at 10:19
@MaxU Vectorized on the CPU or on the GPU? How big can each vector be? — Elmex80s, Jan 18 '18 at 10:21
@Coldspeed thanks, but every core has a vector processing unit, not? Is that used as well? — Elmex80s, Jan 18 '18 at 10:36
One of them is coded in C. the other in pure Python so carries a lot of overhead. See the numpy section of Jake Vanderplas' book https://github.com/jakevdp/PythonDataScienceHandbook/blob/599aa0fe3f882c0001670e676e5a8d43b92c35fc/notebooks/02.00-Introduction-to-NumPy.ipynb (it continues with a couple of subsections). — ayhan, Jan 18 '18 at 10:56
Shouldn't some of these comments be answers e.g. @ayhan your comment basically answers the question — LangeHaare, Jan 18 '18 at 11:59
@LangeHaare There are several discussions on Stack Overflow regarding why numpy/pandas operations are fast. What I would write here would basically repeat those. If I close this as a duplicate, on the other hand, people will complain for differences in detail so I just left a comment. — ayhan, Jan 18 '18 at 12:06
@ayhan could you then mark it as duplicated if we already have answers elsewhere? — stucash, Jan 18 '18 at 12:42

score 2 · Accepted Answer · answered Jan 18 '18 at 14:44

If you follow the Pandas source code for apply you will see that in general it ends up doing a python for __ in __ loop.

However, Pandas DataFrames are made up of Pandas Series, which are under the hood made up of numpy arrays. Masked filtering uses the fast, vectorized methods that numpy arrays allow. For info on why this is faster than doing plain python loops (as in .apply), see Why are NumPy arrays so fast?

Why is filtering DataFrame by boolean mask so much faster than apply()?

1 Answers1