1

I want to check a python panda performance

col = [
        "age",
        "workclass",
        "fnlwgt",
        "education", 
        "educationNum", 
        "maritalStatus",
        "occupation",
        "relationship",
        "race",
        "sex",
        "capitalGain",
        "capitalLoss",
        "hoursPerWeek",
        "nativeCountry", 
        "Above50K" 
]
import pandas as pd
df = pd.read_csv("adult.data", header=None)
df.columns = col

Query 1:
df[(df.education == " HS-grad") & (df.sex == " Female") & (df.age <25)]

Query 2 :
df.loc[(df.education == " HS-grad") & (df.sex == " Female") & (df.age <25), :]

Query 1 vs Query 2 ; which is better performance wise

I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • Have you researched how to check the performance of a code snippet? – Chris Jun 12 '22 at 15:25
  • 1
    Does this answer your question? [How to measure the speed of a python function](https://stackoverflow.com/questions/44677606/how-to-measure-the-speed-of-a-python-function) – Chris Jun 12 '22 at 15:26

1 Answers1

0

I recommend you pandas.query but I benchmark your question with 1 million rows on colab:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age':np.random.randint(0,100,1_000_000),
    'sex': np.random.choice(['Female','Male'],1_000_000),
    'education': np.random.choice(['BS','MS','HS'],1_000_000),
    'workclass': np.random.choice(['A','B','C'],1_000_000),
    'fnlwgt': np.random.choice(['A','B','C'],1_000_000),
    'maritalStatus': np.random.choice(['A','B','C'],1_000_000),
    'race': np.random.choice(['A','B','C'],1_000_000),
    'hoursPerWeek': np.random.choice(['A','B','C'],1_000_000),
    'nativeCountry': np.random.choice(['A','B','C'],1_000_000),
    'capitalGain': np.random.choice(['A','B','C'],1_000_000),
    'capitalLoss': np.random.choice(['A','B','C'],1_000_000),
})


%timeit df[(df.education == "HS") & (df.sex == "Female") & (df.age <25)]
%timeit df.query('(education=="HS") & (sex=="Female") & (age < 25)')
%timeit df.loc[(df.education == "HS") & (df.sex == "Female") & (df.age <25), :]

Output:

10 loops, best of 5: 138 ms per loop
10 loops, best of 5: 82.4 ms per loop
10 loops, best of 5: 139 ms per loop

Check the result of two approaches: (No problem exists)

from pandas.util.testing import assert_frame_equal
res1 = df[(df.education == "HS") & (df.sex == "Female") & (df.age <25)]
res2 = df.query('(education=="HS") & (sex=="Female") & (age < 25)')
assert_frame_equal(res1, res2)

We can check these three queries with different rows of pandas like below:

(We can see with_query better than others and with_loc better than with_out_loc)

enter image description here

Code of above plot and benchmark:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import time

bench = []
for num_row in np.power(np.arange(10,1500,5),2):

    df = pd.DataFrame({
        'age':np.random.randint(0,100,num_row),
        'sex': np.random.choice(['Female','Male'],num_row),
        'education': np.random.choice(['BS','MS','HS'],num_row),
    })

    start = time.time()
    df[(df.education == "HS") & (df.sex == "Female") & (df.age <25)]
    bench.append({'Num_Rows':num_row, 'Method':'with_out_loc', 'Time':time.time() - start})

    start = time.time()
    df.loc[(df.education == "HS") & (df.sex == "Female") & (df.age <25), :]
    bench.append({'Num_Rows':num_row, 'Method':'with_loc', 'Time':time.time() - start})

    start = time.time()
    df.query('(education=="HS") & (sex=="Female") & (age < 25)')
    bench.append({'Num_Rows':num_row, 'Method':'with_query', 'Time':time.time() - start})


plt.subplots(1,1, figsize=(10,7))
df = pd.DataFrame(bench)
sns.lineplot(data=df, x="Num_Rows", y="Time", hue="Method", style="Method")
plt.show()
I'mahdi
  • 23,382
  • 5
  • 22
  • 30