As @jpp answare in Performance of Pandas apply vs np.vectorize to create new column from existing columns:
I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2
1 Numeric types include: int, float, datetime, bool, category. They exclude object dtype and can be held in contiguous memory blocks.
2 There are at least 2 reasons why NumPy operations are efficient versus Python: Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types. NumPy methods are usually C-based. In addition, optimised algorithms are used where possible.
Now I am cunfused, because I was thinking that df.apply is vectorized. For me that means, at all rows are executed parallel in the same time.
I wrote simple code and time of execution show me that df.apply() is executing like df.iterrows()
#!/usr/bin/env python3
# coding=utf-8
"""
File for testing parallel processing for pandas groupby, apply and cudf group apply
"""
import pandas as pd
import cudf as cf
import random
import time as t
amount_rows = 100
def f3(row):
row['d3'] = row['c2'] + 1
t.sleep(0.05)
return row
if __name__ == "__main__":
print(sys.version_info)
print("Pandas: ", pd.__version__)
print("CUDF: ", cf.__version__)
"Creating test data as dict"
l_key = []
for _ in range(amount_rows):
l_key.append(random.randint(0, 9))
d = {'c1': l_key, 'c2': list(range(amount_rows))}
"Creating Pandas DF from dict"
df = pd.DataFrame(d)
"Check if apply execute parallel"
t9 = t.time()
df3 = df.apply(f3, axis = 1 )
t10 = t.time()
diff4 = t10 - t9
print(f"df.apply( f3, axis=1 ) for {amount_rows} takes {round(diff4, 8)} seconds")
"ITERROWS"
aa = t.time()
for key, row in df.iterrows():
row['d3'] = row['c2'] + 1
t.sleep(0.05)
bb = t.time()
diff5 = bb - aa
print(f"df.iterrows( ) for {amount_rows} takes {round(diff5, 8)} seconds")
And result of executed code is:
sys.version_info(major=3, minor=8, micro=0, releaselevel='final', serial=0)
Pandas: 1.5.3
CUDF: 22.12.0
sys.version_info(major=3, minor=8, micro=0, releaselevel='final', serial=0)
df.apply( f3, axis=1 ) for 100 takes 5.05231261 seconds
df.iterrows( ) for 100 takes 5.04581475 seconds
I expected execution time for df.apply lower than 1s. But it looks like df.apply is executing row by row not all rows in the same time.
Can someone help me to understand what is wrong with this code?