What makes apply method in Pandas so inefficient

Question

I'm trying to optimize some Python code that is using Pandas library to process about 1GB of CSV data.
I noticed that apply method in Pandas seems to be working much slower compared to native Python functions.

Specifically code that is using DataFrame.apply method is running about 20 times slower.

Here is some reproducible code example showing comparison between Pandas apply method and native Python functions:

import pandas

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)), columns=list('ABCD'))

f_test = lambda x: x[0] - x[1] + x[2] - x[3]

before = datetime.datetime.now()
tmp1 = df.apply(f_test, axis=1)
after = datetime.datetime.now()
print(after - before)

before = datetime.datetime.now()
tmp2 = df.values.tolist()
tmp3 = pandas.Series(list(map(f_test, tmp2)))
after = datetime.datetime.now()
print(after - before)
print("checking equality", tmp1.equals(tmp3))

The output:

0:00:01.249394
0:00:00.064491
checking equality True

While the outputs are the same, the code in the second example runs much faster.
What is the reason behind Pandas apply method being so slow in this particular example?

AFAIK apply() doesn't use vectorization and It's why its slow on big data — eshirvana, Jan 23 '22 at 00:24
@eshirvana what do you mean by vectorization? is my second code example vectorized or not? — fshabashev, Jan 23 '22 at 01:26

What makes apply method in Pandas so inefficient

0 Answers0