Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable. But I have found at various places that list comparisons are faster than apply. I have experienced that as well. But not able to understand as to what is the internal working that makes it much faster than apply?
I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes. But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)
Edit: adding code: this is working on titanic dataset, where title is extracted from name: https://www.kaggle.com/c/titanic/data
%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
('Master' if 'Master' in x else 'None'))))
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]
Result: 782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit2: To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse:
import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
tlist.append(i)
tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2
display(df_test.head(5))
%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]
display(df_test.head(5))
1 loop, best of 3: 2.14 s per loop
1 loop, best of 3: 2.24 s per loop
Edit3: As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.:
for row in df_test.itertuples():
x = row.B
if x%5==0:
df_test.at[row.Index,'B'] = x*2
Running above code takes around 23 seconds, but apply takes only 1.8 seconds. So, what is the difference between these physical loop in itertuples and apply?