I have a pandas dataframe
. I want to generate a new variable (column) based on multiple column inputs where the year index is greater than a certain value.
The below illustrates what I want to do, but I want to simplify it into a function as the calculation in reality is more complex than the below illustration and the variable names are longer.
Ideally the function would split the calculation into intermediate temporary values (not saved to the df
) and across multiple lines to make it easier to read. For example, could define: Share = (df['B']+df['C']) / (df['B']+df['C']+df['D'])
and then X = A + Share * E
.
I have used apply
previously to apply a function to a dataframe
, but that example only used a single variable as input and no where clause, and I do not know how to extend the example.
How would I simply the below calculation to generate X
based on A
, B
, C
, D
, and E
, where year
>= 2020?
import numpy as np
import pandas as pd
np.random.seed(2981)
df = pd.DataFrame({
'year' : [2018, 2019, 2020, 2021,2018, 2019, 2020, 2021,2018, 2019, 2020, 2021],
'id' : ['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI'],
'A': np.random.choice(range(100),12),
'B': np.random.choice(range(100),12),
'C': np.random.choice(range(100),12),
'D': np.random.choice(range(100),12),
'E': np.random.choice(range(100),12),
})
df = df.set_index('year')
df['X'] = np.where( df.index >= 2020, df['A'] + (df['B']+df['C']) / (df['B']+df['C']+df['D']) * df['E'] , np.nan )