0

As @jpp answare in Performance of Pandas apply vs np.vectorize to create new column from existing columns:

I will start by saying that the power of Pandas and NumPy arrays is derived from high-performance vectorised calculations on numeric arrays.1 The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.2

1 Numeric types include: int, float, datetime, bool, category. They exclude object dtype and can be held in contiguous memory blocks.

2 There are at least 2 reasons why NumPy operations are efficient versus Python: Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types. NumPy methods are usually C-based. In addition, optimised algorithms are used where possible.

Now I am cunfused, because I was thinking that df.apply is vectorized. For me that means, at all rows are executed parallel in the same time.

I wrote simple code and time of execution show me that df.apply() is executing like df.iterrows()

#!/usr/bin/env python3
# coding=utf-8

"""
    File for testing parallel processing for pandas groupby, apply and cudf group apply
"""
import pandas as pd
import cudf as cf
import random
import time as t

amount_rows = 100


def f3(row):
    row['d3'] = row['c2']  + 1
    t.sleep(0.05)
    return row


if __name__ == "__main__":
    print(sys.version_info)
    print("Pandas: ", pd.__version__)
    print("CUDF: ", cf.__version__)
    "Creating test data as dict"
    l_key = []
    for _ in range(amount_rows):
        l_key.append(random.randint(0, 9))
    d = {'c1': l_key, 'c2':  list(range(amount_rows))}

    "Creating Pandas DF from dict"
    df = pd.DataFrame(d)

    "Check if apply execute parallel"
    t9 = t.time()
    df3 = df.apply(f3, axis = 1 )
    t10 = t.time()
    diff4 = t10 - t9
    print(f"df.apply( f3, axis=1 ) for  {amount_rows} takes {round(diff4, 8)} seconds")

    "ITERROWS"
    aa = t.time()
    for key, row in df.iterrows():
        row['d3'] = row['c2']  + 1
        t.sleep(0.05)
    bb = t.time()
    diff5 = bb - aa
    print(f"df.iterrows( ) for  {amount_rows} takes {round(diff5, 8)} seconds")

And result of executed code is:

sys.version_info(major=3, minor=8, micro=0, releaselevel='final', serial=0)
Pandas:  1.5.3
CUDF:  22.12.0
sys.version_info(major=3, minor=8, micro=0, releaselevel='final', serial=0)
df.apply( f3, axis=1 ) for  100 takes 5.05231261 seconds
df.iterrows( ) for  100 takes 5.04581475 seconds

I expected execution time for df.apply lower than 1s. But it looks like df.apply is executing row by row not all rows in the same time.

Can someone help me to understand what is wrong with this code?

luki
  • 197
  • 11
  • 1
    Apply takes each row, passes it as a Series to the function you have, and then executes the function. It isn't vectorised, which here means that it is passed as a single bloc to C code for fast execution. – ifly6 Feb 22 '23 at 13:32
  • 1
    If you just use `df['d3'] = df['c2']+1` then it takes a small fraction of a mSecond. This 'vectorised' approach carries out the looping behind the scenes in fast, compiled code. Both apply and iterrows are much slower even without your sleep delays. – user19077881 Feb 22 '23 at 15:08
  • Thanks for answers. I am looking for fast solution for processing data like that. For each row from df (df is frame with parameter, that mean that df['c2'] is parametr column) i am calculation something on other dataframe. But as @iffy6 said it only take row and pass it to function. I will try with multiprocessing. – luki Feb 22 '23 at 15:22
  • @user19077881 df['d3'] = df['c2']+1 this is only short example. I suspected row-by-row processing thats why i wrote small test program. In other comment i wrote what i am looking for. – luki Feb 22 '23 at 15:29
  • See [link](https://realpython.com/pandas-iterate-over-rows/) and similar references. Note that if looping is unavoidable then `itertuples` is faster and preferable. – user19077881 Feb 22 '23 at 15:33
  • `df.apply` is notorious for being slow. Use of its `raw` mode improves speed, more like iterating through row of numpy array - without the pandas rows index baggage. We see lots of SO questions where people try to replace `apply` with a `np.where` expression, though they often misunderstand its use. `where` isn't an iterator like `apply`; it takes whole arrays or Series as inputs. – hpaulj Mar 03 '23 at 17:26

0 Answers0