Why is list comprehension faster than apply in pandas

Question

Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable. But I have found at various places that list comparisons are faster than apply. I have experienced that as well. But not able to understand as to what is the internal working that makes it much faster than apply?

I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes. But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)

Edit: adding code: this is working on titanic dataset, where title is extracted from name: https://www.kaggle.com/c/titanic/data

%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
                                         ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
                                                ('Master' if 'Master' in x else 'None'))))

%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]

Result: 782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Edit2: To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse:

import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
  tlist.append(i)
  tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2

display(df_test.head(5))


%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]

display(df_test.head(5))

1 loop, best of 3: 2.14 s per loop

1 loop, best of 3: 2.24 s per loop

Edit3: As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.:

for row in df_test.itertuples():
  x = row.B
  if x%5==0:
    df_test.at[row.Index,'B'] = x*2

Running above code takes around 23 seconds, but apply takes only 1.8 seconds. So, what is the difference between these physical loop in itertuples and apply?

Here's an interesting and [related SO question and answer](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care) that I have bookmarked — G. Anderson, Aug 12 '19 at 21:25
@G.Anderson , thanks for the link but it says apply is slower but not why — Tushar Seth, Aug 12 '19 at 21:35
`.apply` is basically a for-loop. It does not use vectorization. And note, list comprehensions are only marginally faster than for loops, and both can be made essentially equally performant if you cache the `.append` method resolution, which is practically what a list comprehension does (note it still uses append) — juanpa.arrivillaga, Aug 12 '19 at 22:53
@TusharSeth it does, you just need to look for it. It is essentially a slow wrapper around a for loop with a lot of overhead which usually isn't required for most simple operations. — cs95, Aug 15 '19 at 07:32
I am reffering to this SO post : https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code This says that never to use apply because it is much slower, but there is no answer to why, thats why i asked this seperate question — Tushar Seth, Aug 19 '19 at 09:19
@juanpa.arrivillaga , added in question as apply is not for loop as for loop is way much slower than apply, so it has to be something else — Tushar Seth, Aug 28 '19 at 09:13
@QuangHoang as explained to juanpa, added in question as apply is not for loop as for loop is way much slower than apply, so it has to be something else — Tushar Seth, Aug 28 '19 at 09:18
@TusharSeth because the loop you are using the the *slowest possible way*. **Never** use `x = df_test.loc[i,'B']`, try it with `df.itertuples()`. It **is a loop**. You can [check the source code yourself](https://stackoverflow.com/questions/38938318/why-apply-sometimes-isnt-faster-than-for-loop-in-pandas-dataframe/38938507#38938507) — juanpa.arrivillaga, Aug 28 '19 at 16:28
@juanpa.arrivillaga +1 for that link. But I have a doubt: for row in df_test.itertuples(): x = row.B if x%5==0: print(row.B) This code using itertuples also is very very slow . Apologies if I am missing on something, but i really need to get this through my head as to how come apply for loop is faster than this physical for loop — Tushar Seth, Aug 28 '19 at 19:06
@juanpa.arrivillaga. updated the itertuples code. it takes around 23 seconds but apply works in just 1 second , so that was my doubt as to what would be difference in the implementation they are using — Tushar Seth, Aug 29 '19 at 09:56
I assumed apply was faster because pandas Series' are just numpy arrays, and numpy runs on C. — imad97, Nov 24 '22 at 03:48

Alex Bochkarev · Answer 1 · 2023-01-06T17:00:21.020

There are a few reasons for the performance difference between apply and list comprehension.

First of all, list comprehension in your code doesn't make a function call on each iteration, while apply does. This makes a huge difference:

map_function = lambda x: 'Mrs.' if 'Mrs' in x else \
                 ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
                 ('Master' if 'Master' in x else 'None')))

%timeit train['NameTitle'] = [map_function(x) for x in train['Name']]
# 581 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else \
                 ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
                 ('Master' if 'Master' in x else 'None'))) for x in train['Name']]
# 482 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Secondly, apply does much more than list comprehension. For example it tries to find appropriate dtype for the result. By disabling that behaviour you can see what impact it has:

%timeit train['NameTitle'] = train['Name'].apply(map_function)
# 660 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = train['Name'].apply(map_function, convert_dtype=False)
# 626 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There's also a bunch of other stuff happening within apply, so in this example you would want to use map:

%timeit train['NameTitle'] = train['Name'].map(map_function)
# 545 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Which performs better than list comprehension with a function call in it.

Then why use apply at all you might ask? I know at least one example where it outperforms everything else -- when the operation you want to apply is a vectorized universal function. That's because apply unlike map and list comprehension allows the function to run on the whole Series instead of individual objects in it. Let's see an example:

%timeit train['AgeExp'] = train['Age'].apply(lambda x: np.exp(x))
# 1.44 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].apply(np.exp)
# 256 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].map(np.exp)
# 1.01 ms ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = [np.exp(x) for x in train['Age']]
# 1.21 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why is list comprehension faster than apply in pandas

1 Answers1