I want to compare the performance between 2 different methods to filter pandas DataFrames. So I created a test set with n
points in the plane and I filter out all points which are not in the unit square. I am surprised one method is so much faster than the other. The larger n
becomes the bigger the difference. What would be the explanation for that?
This is my script
import numpy as np
import time
import pandas as pd
# Test set with points
n = 100000
test_x_points = np.random.uniform(-10, 10, size=n)
test_y_points = np.random.uniform(-10, 10, size=n)
test_points = zip(test_x_points, test_y_points)
df = pd.DataFrame(test_points, columns=['x', 'y'])
# Method a
start_time = time.time()
result_a = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)
# Method b
start_time = time.time()
result_b = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)
# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)
Results for different values of n
:
For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.
For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.
For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.
For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.
For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.
For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.
When I compare it to Python native list comprehension method a is still much faster
result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]
Why is that?