Why numpy's where operation is faster than apply function?

Question

While creating a new column in pandas dataframe based on some condition, numpy's where method outperforms the apply method in terms of execution time, why is that so?

For example:

df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)

df["log2FC"] = np.where(df["C1Mean"]==0,
                        np.log2(df["C2Mean"]), 
                        np.log2(df["C2Mean"]/df["C1Mean"]))

`apply` is syntactic sugar for looping row-wise. In your other snippet it's acting on the entire columns — EdChum, May 13 '19 at 09:55

EdChum · Accepted Answer · 2019-05-13T10:05:11.680

4

This call to apply is row-wise iteration:

df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)

apply is just syntactic sugar for looping, you passed axis=1 so it's row-wise.

Your other snippet

df["log2FC"] = np.where(df["C1Mean"]==0,
                        np.log2(df["C2Mean"]), 
                        np.log2(df["C2Mean"]/df["C1Mean"]))

is acting on the entire columns, so it's vectorised.

The other thing is that pandas is performing more checking, index-alignment, etc.. than numpy.

Your calls to np.log2 are meaningless in this context as you pass scalar values:

 np.log2(x["C2Mean"]/x["C1Mean"])

performance-wise it would be the same as calling math.log2

Explaining why numpy is significantly faster or what is vectorisation is beyond the scope of this question. You can see this: What is vectorization?.

The essential thing here is that numpy can and will use external libraries written in C or Fortran which are inherently faster than python.

edited May 13 '19 at 10:05

answered May 13 '19 at 09:57

EdChum

376,765
198
813
562

Thanks, EdChum, Can you also explain how numpy makes the vectorized operations faster. Is it because of parallelization using threading or multiprocessing? – ashish14 May 13 '19 at 10:01
if we use broadcast as result type will this help to vectorize the apply? it doesn't have sense, apply should be smart enough. – prosti May 13 '19 at 10:04
@prosti that won't make any difference, that param is for determining what shape will be returned, here as it's row-wise it's irrelevant – EdChum May 13 '19 at 10:06
@ashish14 see update, it's beyond the scope of this answer – EdChum May 13 '19 at 10:07

Why numpy's where operation is faster than apply function?

1 Answers1