I am a R programmer currently trying to learn Python / Pandas. Currently I am trying to grapple with how to clearly and cleanly create a new variable from a function that uses multiple existing variables.
Note that the function used in my example isn't that complex but I am trying to generalise to the case of an arbitrary function that could be significantly more complex or require more variables, that is to say I am trying to avoid solutions that are optimised for this specific function and more looking how to handle the general scenario.
For reference this is an example of how I would do this in R.
library(tidyverse)
df <- data_frame(
num = c(15, 52 , 24 , 29),
cls = c("a" , "b" , "b", "a")
)
attempt1 <- function( num , cls){
if ( cls == "a") return( num + 10)
if ( cls == "b") return( num - 10)
}
## Example 1
df %>%
mutate( num2 = map2_dbl( num , cls , attempt1))
## Example 2
df %>%
mutate( num = ifelse( num <= 25 , num + 10 , num)) %>%
mutate( num2 = map2_dbl( num , cls , attempt1))
Reading the pandas documentation as well as various SO posts I have found multiple ways of achieving this in python, however none of them sit well with me. For reference I've posted my current 3 solutions below:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"num" : [14, 52 , 24 , 29],
"cls" : ["a" , "b" , "b" ,"a"]
})
### Example 1
def attempt1( num, cls):
if cls == "a":
return num + 10
if cls == "b":
return num - 10
df.assign( num2 = df.apply( lambda x: attempt1(x["num"] , x["cls"]) , axis = 1))
def attempt2( df):
if df["cls"] == "a":
return df["num"] + 10
if df["cls"] == "b":
return df["num"] - 10
df.assign( num2 = df.apply(attempt2, axis=1))
def attempt3(df):
df["num2"] = attempt1(df["num"], df["cls"])
return df
df.apply( attempt3 , axis = 1)
### Example 2
df.assign( num = np.where( df["num"] <= 25 , df["num"] + 10 , df["num"]))\
.apply( attempt3 , axis = 1)
My issue with attempt 1 is that it appears to be quite horribly verbose. In addition you need to self reference back to your starting dataset which means that if you wanted to chain multiple derivations together you would have to write out your dataset to intermediate variables even if you had no intention of keeping it.
Attempt2 has significantly cleaner syntax but still suffers from the intermediate variable problem. Another issue is that the function expects a dataframe which makes the function harder to unittest, less flexible and less clear on what the inputs should be.
Attempt3 seems to be the best to me in terms of functionality as it provides you with a clear testable function and doesn't require the saving of intermediate datasets. The major downside being that you now have to have 2 functions which feels like redundant code.
Any help or advice would be greatly appreciated.