2

I want to compare the performance between 2 different methods to filter pandas DataFrames. So I created a test set with n points in the plane and I filter out all points which are not in the unit square. I am surprised one method is so much faster than the other. The larger n becomes the bigger the difference. What would be the explanation for that?

This is my script

import numpy as np
import time
import pandas as pd


# Test set with points
n              = 100000
test_x_points  = np.random.uniform(-10, 10, size=n)
test_y_points  = np.random.uniform(-10, 10, size=n)
test_points    = zip(test_x_points, test_y_points)
df             = pd.DataFrame(test_points, columns=['x', 'y'])


# Method a
start_time     = time.time()
result_a       = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time       = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)


# Method b
start_time     = time.time()
result_b       = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time       = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)


# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)

Results for different values of n:

For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.

For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.

For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.

For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.

For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.

For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.

When I compare it to Python native list comprehension method a is still much faster

result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]

Why is that?

Elmex80s
  • 3,428
  • 1
  • 15
  • 23
  • 3
    `df.apply(..., axis=1)` - is a `for loop` under the hood - i.e. it's NOT vectorized – MaxU - stand with Ukraine Jan 18 '18 at 10:19
  • @MaxU Vectorized on the CPU or on the GPU? How big can each vector be? – Elmex80s Jan 18 '18 at 10:21
  • 1
    "vectorised" == parallel/threaded processing. – cs95 Jan 18 '18 at 10:26
  • @Coldspeed thanks, but every core has a vector processing unit, not? Is that used as well? – Elmex80s Jan 18 '18 at 10:36
  • 3
    One of them is coded in C. the other in pure Python so carries a lot of overhead. See the numpy section of Jake Vanderplas' book https://github.com/jakevdp/PythonDataScienceHandbook/blob/599aa0fe3f882c0001670e676e5a8d43b92c35fc/notebooks/02.00-Introduction-to-NumPy.ipynb (it continues with a couple of subsections). – ayhan Jan 18 '18 at 10:56
  • 1
    Shouldn't some of these comments be answers e.g. @ayhan your comment basically answers the question – LangeHaare Jan 18 '18 at 11:59
  • @LangeHaare There are several discussions on Stack Overflow regarding why numpy/pandas operations are fast. What I would write here would basically repeat those. If I close this as a duplicate, on the other hand, people will complain for differences in detail so I just left a comment. – ayhan Jan 18 '18 at 12:06
  • @ayhan could you then mark it as duplicated if we already have answers elsewhere? – stucash Jan 18 '18 at 12:42

1 Answers1

2

If you follow the Pandas source code for apply you will see that in general it ends up doing a python for __ in __ loop.

However, Pandas DataFrames are made up of Pandas Series, which are under the hood made up of numpy arrays. Masked filtering uses the fast, vectorized methods that numpy arrays allow. For info on why this is faster than doing plain python loops (as in .apply), see Why are NumPy arrays so fast?

Top answer from there:

Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.

LangeHaare
  • 2,776
  • 2
  • 17
  • 25