3

I am a R programmer currently trying to learn Python / Pandas. Currently I am trying to grapple with how to clearly and cleanly create a new variable from a function that uses multiple existing variables.

Note that the function used in my example isn't that complex but I am trying to generalise to the case of an arbitrary function that could be significantly more complex or require more variables, that is to say I am trying to avoid solutions that are optimised for this specific function and more looking how to handle the general scenario.

For reference this is an example of how I would do this in R.

library(tidyverse)

df <- data_frame(
    num = c(15, 52 , 24 , 29),
    cls = c("a" , "b" , "b", "a")
)

attempt1 <- function( num , cls){
    if ( cls == "a") return( num + 10)
    if ( cls == "b") return( num - 10)
}

## Example 1
df %>% 
    mutate( num2 = map2_dbl( num , cls , attempt1))

## Example 2
df %>% 
    mutate( num = ifelse( num <= 25 , num + 10 , num)) %>% 
    mutate( num2 = map2_dbl( num , cls , attempt1))

Reading the pandas documentation as well as various SO posts I have found multiple ways of achieving this in python, however none of them sit well with me. For reference I've posted my current 3 solutions below:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "num" : [14, 52 , 24 , 29],
    "cls" : ["a" , "b" , "b" ,"a"]
})

### Example 1

def attempt1( num, cls):
    if cls == "a":
        return num + 10
    if cls == "b":
        return num - 10

df.assign( num2 = df.apply( lambda x: attempt1(x["num"] , x["cls"]) , axis = 1))


def attempt2( df):
    if df["cls"] == "a":
        return df["num"] + 10
    if df["cls"] == "b":
        return df["num"] - 10

df.assign( num2 = df.apply(attempt2, axis=1))



def attempt3(df):
    df["num2"] = attempt1(df["num"], df["cls"])
    return df

df.apply( attempt3 , axis = 1)



### Example 2

df.assign( num = np.where( df["num"] <= 25 , df["num"] + 10 , df["num"]))\
    .apply( attempt3 , axis = 1)

My issue with attempt 1 is that it appears to be quite horribly verbose. In addition you need to self reference back to your starting dataset which means that if you wanted to chain multiple derivations together you would have to write out your dataset to intermediate variables even if you had no intention of keeping it.

Attempt2 has significantly cleaner syntax but still suffers from the intermediate variable problem. Another issue is that the function expects a dataframe which makes the function harder to unittest, less flexible and less clear on what the inputs should be.

Attempt3 seems to be the best to me in terms of functionality as it provides you with a clear testable function and doesn't require the saving of intermediate datasets. The major downside being that you now have to have 2 functions which feels like redundant code.

Any help or advice would be greatly appreciated.

gowerc
  • 1,039
  • 9
  • 18
  • It would help if you define what you mean by "Pythonic". Pythonic (or, in this case, Pandonic / Pandorable) would mean getting rid of `pd.DataFrame.apply` completely. This is true both in general and specific scenarios. – jpp Apr 10 '18 at 10:22

2 Answers2

1

You can rely on Series.where to do the job, by creating a column that contains 10, and changing it to -10 depending on the value of cls. You can then use that column to perform an arithmetic operation like you want.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.where.html

Step by step (verbose) example:

df['what_to_add'] = 10
df['what_to_add'] = df['what_to_add'].where(df['cls'] == 'a', -10)
df['num'] = df['num'] + df['what_to_add']

Another possibility given that your two numbers are opposite is to define a column for the sign of the operand:

df['sign'] = 1 - 2 * (df['cls'] == 'a').astype(int)
df['num'] = df['num'] + df['sign'] * 10

A third way to do that is to use replace, so that you replace "a" by 10 and "b" by -10:

df['what_to_add'] = df['cls'].replace(['a', 'b'], [10, -10])
df['num'] = df['num'] + df['what_to_add']

edited: Or, as proposed by JPP (https://stackoverflow.com/a/49748695/4582949), using map:

df['num2'] += df['cls'].map({'a': 10, 'b': -10})
Guybrush
  • 2,680
  • 1
  • 10
  • 17
  • Hi, Thank you for taking the time to reply. Sorry to be a pain but I was hoping to keep the solution as a function of the core inputs so that it is possible to abstract and unittest the derivation. Although what you've shared works for this example (thank you for helping) I am not sure how it would scale to more complex derivations or those that require multiple inputs. – gowerc Apr 10 '18 at 08:32
  • Could you elaborate? I don't get it :) – Guybrush Apr 10 '18 at 08:38
  • Sure - To clarify I guess I was hoping to find a solution of the form `df["new_var"] = fun( df["var1"] , df["var2"] , ... )` where fun is any arbitrary function that takes in 1 value from each column at a time (thus my use of apply). For example suppose the function in my example was instead: ```def (n1 , n2 , c3): if n1 > n2 and c3 in ["a" , "b"]: return n2 + n1 * 2 if n2**3 == n1/2 and c3 not in ["c"] return(n1) else: return(n2 / 2) ``` For arbitrary numerics n1 / n2 and an arbitrary character c3 (happy to setup testdata for this if it helps) – gowerc Apr 10 '18 at 08:47
  • That can be done with `.apply` but using an arbitrary function means that you won't benefit from pandas/numpy performance however :-/ You should try as much as possible to work directly on arrays. `numpy.where(cond, x, y)` can help you for that, taking values from `x` if `cond` holds, from `y` otherwise. – Guybrush Apr 10 '18 at 08:53
1

One efficient method is to use pd.Series.map:

df['num2'] += df['cls'].map({'a': 10, 'b': -10})

This uses a dictionary to map values of cls to either 10 or -10.

There are many other methods (see @Guybrush's answer), but the dictionary-based method is extendable and efficient for larger dataframes. In my opinion, it is also readable.

Relevant: Replace values in a pandas series via dictionary efficiently

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Hi, Thank you for taking the time to reply. Sorry to be a pain but I was hoping to keep the solution as a function of the core inputs so that it is possible to abstract and unittest the derivation. Although what you've shared works for this example (thank you for helping) I am not sure how it would scale to more complex derivations or those that require multiple inputs. – gowerc Apr 10 '18 at 08:32
  • @CroGo, Thanks for the feedback. Not sure I follow you. (1) This is a function of the core inputs; (2) dictionaries are native to Python, in a unittest you can easily compare a dictionary to a test object; (3) this method is *extremely* scalable. Your dictionary can have a million items; this will still work efficiently. – jpp Apr 10 '18 at 08:35
  • Sure - To clarify I guess I was hoping to find a solution of the form `df["new_var"] = fun( df["var1"] , df["var2"] , ... )` where fun is any arbitrary function that takes in 1 value from each column at a time (thus my use of apply). For example suppose the function in my example was instead: ```def (n1 , n2 , c3): if n1 > n2 and c3 in ["a" , "b"]: return n2 + n1 * 2 if n2**3 == n1/2 and c3 not in ["c"] return(n1) else: return(n2 / 2) ``` For arbitrary numerics n1 / n2 and an arbitrary character c3 (happy to setup testdata for this if it helps) – gowerc Apr 10 '18 at 08:45
  • @CroGo, this is still very easily vectorisable. Note your logic with custom `pd.DataFrame.apply` will work *line by line* in a Python-level loop, instead of a vectorised fashion. The custom function explicitly avoids use of `pandas` vectorised functionality. I see your point on separating an "assignment function" - if this is what you are doing, you could do equally well *without* `pandas`. (Just hold your data in a list of lists!) – jpp Apr 10 '18 at 08:50