0

I am having a performance issue while implementing iterrows(). My code is something like

for _, row in df.iterrows():
    row["new_col"] = \
        df.apply(lambda x:some_func(row["col1"], ...), axis=1)

some_func() is a kind of complicated function and cannot take input as Series and DataFrame requiring some specific value from the same row.

However, increasing number of rows increases time to process data exponentially, not linearly.

Is there some suggestion on how to speed it up? Probably splitting into smaller groups may improve or using something else instead of iterrow().

Any comment is appreciated.

EDIT 1.

for count, row in df.iterrows():
    df.loc[count, "new_col"] = some_func(row["col1"], ... )
loamoza
  • 128
  • 8
  • 3
    It depends of `complicated function`, what is reason for loop by `iterrows` and then loop in `apply` ? – jezrael Jun 07 '22 at 06:05
  • 1
    from your code need for each row loop by all rows in original DataFrame by `df.apply(lambda x:...` ? – jezrael Jun 07 '22 at 06:07
  • 8
    Should just be `df['new_col'] = df.apply(lambda x: some_func(x['col1'], ...), axis=1)`. – Quang Hoang Jun 07 '22 at 06:12
  • @jezrael The reason I am using ```iterrows``` is because I have to iterare over each cell of row while using the input from the same row and not being able to use vectors due to function limitation. I am thinking to rewrite my function, but that might be very complicated. Thanks for your reply! – loamoza Jun 07 '22 at 06:13
  • 1
    @loamoza - comment means remove `for _, row in df.iterrows():` an use only `df['new_col'] = df.apply(lambda x: some_func(x['col1'], ...), axis=1)` – jezrael Jun 07 '22 at 06:17
  • 1
    @loamoza no, you don't need `iterrows` here. Notice the `df['new_col']` in my comment. `apply` helps you align all the cells. – Quang Hoang Jun 07 '22 at 06:18
  • you might be interested in [this question](https://stackoverflow.com/questions/24870953/does-pandas-iterrows-have-performance-issues). – Quang Hoang Jun 07 '22 at 06:20
  • @QuangHoang You are right! I have just fixed the problem. I am adding my solution in the next edit. – loamoza Jun 07 '22 at 06:32
  • You are still using `for loop`. The `apply` itself is a `for loop` in a way. So either use `apply` alone or use `for loop` alone. So far your problem can be solved by Quang Hoang comment. You editted your question but still included the for loop. Remove/delete the for loop. The solution in the comment is Sufficient – Onyambu Jun 07 '22 at 07:03
  • And as @jezrael pointed, it depends on `some_func`. Often complex functions can be simplified to a composition of trivial functions. If not, you can try to [numba](https://numba.pydata.org/)-it. – Zaero Divide Jun 07 '22 at 11:13
  • And BTW, just to be pedantic, _"if data cannot be vectorized"_ is wrong. Data is a vector/matrix already. What cannot be vectorized is the function. – Zaero Divide Jun 07 '22 at 11:18

1 Answers1

0

For the test purpose assume that your function is:

def some_func(c1, c2, c3, c4, c5):
    return c2 + c4 + 100

The test DataFrame I defined as:

   col1  col2  col3  col4  col5
0     1     2     3     4     5
1     6     7     8     9    10
2    11    12    13    14    15

If your function can receive neither a DataFrame nor a Series, you can write an "adapter function":

def myAdapter(row):
    return some_func(*row[:5])

I put [:5] to allow that your DataFrame has some "additional" columns. Then some_func is still called with values from first 5 columns of the source row.

And to generate the new column in df, run:

df['new_col'] = df.apply(myAdapter, raw=True, axis=1)

raw=True causes that the argument passed to myAdapter (the current row) is actually passed as a Numpy array, instead of pandasonic Series.

Its elements can be referred to by integer indices and while calling some_func, it is possible to perform "list expansion" (concesutive elements of row are passed as consecutive parameters of some_func

For my test data, the result is:

   col1  col2  col3  col4  col5  new_col
0     1     2     3     4     5      106
1     6     7     8     9    10      116
2    11    12    13    14    15      126

This way you have at least this gain that the execution time does not increase exponentially. And in my opinion it should execute faster than your code.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41