This call to apply
is row-wise iteration:
df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)
apply
is just syntactic sugar for looping, you passed axis=1
so it's row-wise.
Your other snippet
df["log2FC"] = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
is acting on the entire columns, so it's vectorised.
The other thing is that pandas
is performing more checking, index-alignment, etc.. than numpy
.
Your calls to np.log2
are meaningless in this context as you pass scalar values:
np.log2(x["C2Mean"]/x["C1Mean"])
performance-wise it would be the same as calling math.log2
Explaining why numpy is significantly faster or what is vectorisation is beyond the scope of this question. You can see this: What is vectorization?.
The essential thing here is that numpy can and will use external libraries written in C or Fortran which are inherently faster than python.