I have pandas dataframe
A | B | C |
---|---|---|
AA | BB | CC |
AAA | BBB | CCC |
Now I have a function that takes the columns B and C and returns a series based on the values of columns B and C.
I have tried something like this
def do_magic(a: pd.Series, b: pd.Series) -> pd.Series:
def magic(aa, bb):
return <SOME MAGIC SPELL>
magic_spells = []
for aaa, bbb in zip(a, b):
magic_spells.append(do_magic(aaa, bbb)))
return pd.Series(magic_spells)
This works fine, But I wondering if I can improve the performance of this code with something similar to the pandas dataframe apply method.
def do_magic(dataframe):
def magic(aa, bb):
return <SOME MAGIC SPELL>
return dataframe.apply(lambda row: magic(dataframe['a'], dataframe['b']), axis=1)
The second function is more performant than the first. But I can't pass the entire dataframe to the function. Actually, I am trying to build a PySpark pandas_udf function.
Any help will be appreciated.