1

I have a pandas dataframe. I want to generate a new variable (column) based on multiple column inputs where the year index is greater than a certain value.

The below illustrates what I want to do, but I want to simplify it into a function as the calculation in reality is more complex than the below illustration and the variable names are longer.

Ideally the function would split the calculation into intermediate temporary values (not saved to the df) and across multiple lines to make it easier to read. For example, could define: Share = (df['B']+df['C']) / (df['B']+df['C']+df['D']) and then X = A + Share * E.

I have used apply previously to apply a function to a dataframe, but that example only used a single variable as input and no where clause, and I do not know how to extend the example.

How would I simply the below calculation to generate X based on A, B, C, D, and E, where year >= 2020?

import numpy as np
import pandas as pd

np.random.seed(2981)

df = pd.DataFrame({
    'year' : [2018, 2019, 2020, 2021,2018, 2019, 2020, 2021,2018, 2019, 2020, 2021],
    'id'   : ['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI'],
    'A': np.random.choice(range(100),12),
    'B': np.random.choice(range(100),12),
    'C': np.random.choice(range(100),12),
    'D': np.random.choice(range(100),12),
    'E': np.random.choice(range(100),12),
})
df = df.set_index('year')

df['X'] = np.where( df.index >= 2020,  df['A'] + (df['B']+df['C']) / (df['B']+df['C']+df['D']) * df['E'] , np.nan )
brb
  • 1,123
  • 17
  • 40
  • `np.where` looks fine, I don't understand what you want to simplify. – Ynjxsjmh May 06 '22 at 16:49
  • 1
    You should use vectorized methods wherever possible and what you already have looks fine for the job you describe. Depending on the number of rows, `apply` might get much, much slower. If you describe your actual function, we might be able to help you with writing it in vectorized form. –  May 06 '22 at 17:00
  • @enke. Thank you. The calculation above is kind of my actual function. I want to create a new variable X, based on the above calculation. Let's say the calculation is A + Intermediate1 * Intermediate2 where Intermediate1 is a function of 5 variables on the df and Intermediate2 is a function of 2 variables on the df. I was trying to avoid generating temporary variables Intermediate1 and Intermediate2 as they are not needed. But maybe that is the best way to do it? – brb May 06 '22 at 17:04
  • @brb you don’t need to assign it anywhere. You can calculate the two intermediate variables, keep them in memory and assign whenever you need. –  May 06 '22 at 17:10

1 Answers1

1

First, you should only use apply if necessary. Vectorized functions will be much faster, and the way you have it written now in the np.where statement makes use of these. If you really want to make your code more readable (at the (probably small) expense of time and memory) you could make an intermediate column and then use it in the np.where statement.

df["Share"] = ( df.B + df.C ) / ( df.B + df.C + df.D )
df["X"] = ( df.A + df.Share * df.E ).where( df.index >= 2020 )

To answer your question, however, you can create a custom function and then apply it to your DataFrame.

def my_func( year,a,b,c,d,e ):
    #This function can be longer and do more things
    return np.nan if year < 2020 else a + ( ( (b + c) / (b + c + d) ) * e )


df['X'] = df.apply( lambda x: my_func( x.name, x.A, x.B, x.C, x.D, x.E ), axis = 1 )

Note that to access then index of a row when using apply with axis = 1 you need to use the name attribute.

Also, since applying a function is relatively slow, it may be worth creating columns that take care of some of the intermediate steps (such as summing several columns, etc.) so that that doesn't need to be done in each iteration.

Check out this answer for more examples of applying a custom function.

Leo
  • 446
  • 2
  • 9
  • Thank you. So, for a large dataset, you would recommend creating new columns of intermediate values and then dropping them later? Would this be faster / more efficient? – brb May 06 '22 at 17:00
  • 1
    In general, you should avoid using apply (which will loop over the rows) if you don't need it (vectorized functions are MUCH faster). So, if you can replace your whole expression with built-in, vectorized functions that's ideal. If you do need to use apply, then I would try to make the function you are applying as light as possible, and making use of the vectorized functions for intermediate steps like df.B+df.C. – Leo May 06 '22 at 17:04
  • Thanks. So if the dataset is largeish and the calculations reasonable simple, would you recommend I generate intermediate variables say intermediate1 = (b+c)/(b+c+d) etc and do it that way rather than apply a function using apply? I guess what I am asking, is I want to do it best practice, so should I abandon apply, or is there a way to define a function and used it in a vectorised way? – brb May 06 '22 at 17:12
  • 1
    You can look into more ways of vectorizing custom functions (https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c) but in general, abandon apply wherever possible, and make use of pandas and numpy vectorization (via performing operations directly with columns, using np.where, etc.). – Leo May 06 '22 at 17:22
  • Thank you. I will check out this answer to get a better understanding. Appreciate your help and insights. – brb May 06 '22 at 17:23