0

I'm trying to optimize some Python code that is using Pandas library to process about 1GB of CSV data.
I noticed that apply method in Pandas seems to be working much slower compared to native Python functions.

Specifically code that is using DataFrame.apply method is running about 20 times slower.

Here is some reproducible code example showing comparison between Pandas apply method and native Python functions:

import pandas

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)), columns=list('ABCD'))

f_test = lambda x: x[0] - x[1] + x[2] - x[3]

before = datetime.datetime.now()
tmp1 = df.apply(f_test, axis=1)
after = datetime.datetime.now()
print(after - before)

before = datetime.datetime.now()
tmp2 = df.values.tolist()
tmp3 = pandas.Series(list(map(f_test, tmp2)))
after = datetime.datetime.now()
print(after - before)
print("checking equality", tmp1.equals(tmp3))

The output:

0:00:01.249394
0:00:00.064491
checking equality True

While the outputs are the same, the code in the second example runs much faster.
What is the reason behind Pandas apply method being so slow in this particular example?

fshabashev
  • 619
  • 6
  • 20

0 Answers0