4

While creating a new column in pandas dataframe based on some condition, numpy's where method outperforms the apply method in terms of execution time, why is that so?

For example:

df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)

df["log2FC"] = np.where(df["C1Mean"]==0,
                        np.log2(df["C2Mean"]), 
                        np.log2(df["C2Mean"]/df["C1Mean"]))
ashish14
  • 650
  • 1
  • 8
  • 20
  • 1
    `apply` is syntactic sugar for looping row-wise. In your other snippet it's acting on the entire columns – EdChum May 13 '19 at 09:55

1 Answers1

4

This call to apply is row-wise iteration:

df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)

apply is just syntactic sugar for looping, you passed axis=1 so it's row-wise.

Your other snippet

df["log2FC"] = np.where(df["C1Mean"]==0,
                        np.log2(df["C2Mean"]), 
                        np.log2(df["C2Mean"]/df["C1Mean"]))

is acting on the entire columns, so it's vectorised.

The other thing is that pandas is performing more checking, index-alignment, etc.. than numpy.

Your calls to np.log2 are meaningless in this context as you pass scalar values:

 np.log2(x["C2Mean"]/x["C1Mean"])

performance-wise it would be the same as calling math.log2

Explaining why numpy is significantly faster or what is vectorisation is beyond the scope of this question. You can see this: What is vectorization?.

The essential thing here is that numpy can and will use external libraries written in C or Fortran which are inherently faster than python.

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks, EdChum, Can you also explain how numpy makes the vectorized operations faster. Is it because of parallelization using threading or multiprocessing? – ashish14 May 13 '19 at 10:01
  • if we use broadcast as result type will this help to vectorize the apply? it doesn't have sense, apply should be smart enough. – prosti May 13 '19 at 10:04
  • @prosti that won't make any difference, that param is for determining what shape will be returned, here as it's row-wise it's irrelevant – EdChum May 13 '19 at 10:06
  • @ashish14 see update, it's beyond the scope of this answer – EdChum May 13 '19 at 10:07