Create new column depending on other columns: DRY and go functional

Question

First: Maybe this fits better in code review, but I think here are more pandas affine users. Feel free to move it if you think otherwise

It is often the case, that one wants to compute a new column out of existing ones: Consider the following pandas Dataframe

df = pd.DataFrame({'ItemNumber':np.arange(1, 10), 
     'A':[1,1,1,2,2,2,3,3,3], 'B': [1,2,3]*3})
print(df)
   A  B  ItemNumber
0  1  1           1
1  1  2           2
2  1  3           3
3  2  1           4
4  2  2           5
5  2  3           6
6  3  1           7
7  3  2           8
8  3  3           9

Let's say we compute a new column 'C' via

df.loc[(df["A"] == 1) & (df["B"] > 1), 'C'] = 1
df.loc[(df["B"] == 1) & (df["A"] > 1), 'C'] = 2 
df.loc[(df["A"] > 1) &  (df["B"] > 1), 'C' ] = 3
df.loc[(df["A"] == 1) & (df["B"] == 1), 'C' ] = 4

This will perform quite fast also on large dataframes when compared to the iterative approach presented here. Particularly, the perfomance issues with that approach lead me to the code presented above.

However, this code violates the DRY principle. Copy pasting feels akward.

So let's go a bit more functional and define two curried functions:

def col_equals_value(col, value):
    def filter_df(df):
        return df[col] == value
    return filter_df

def col_greater_value(col, value):
    def filter_df(df):
        return df[col] > value
    return filter_df

from there on, we define our comparisons:

a1 = col_equals_value('A', 1)
b1 = col_equals_value('B', 1)
agt1 = col_greater_value('A', 1)
bgt1 = col_greater_value('B', 1)

Another function is needed to assign the values to a column:

def assign_value(cond_1, cond_2, value):
    def assign_col_value(df, col):
        df.loc[df.apply(cond_1, axis=1) & df.apply(cond_2, axis=1), col] =value
    return assign_col_value

And finally, we can define the condtion-to-values mapping as

mapping = [(a1, b1, 4), 
           (a1, bgt1, 1), 
           (agt1, b1, 2), 
           (agt1, bgt1, 3)]

construct the assign_value_functions

m = [assign_value(x, y, z) for (x,y,z) in mapping]

and apply each function to the dataframe:

for f in m:
    f(df, 'C')
print(df)

   A  B  ItemNumber
0  1  1           1
1  1  2           2
2  1  3           3
3  2  1           4
4  2  2           5
5  2  3           6
6  3  1           7
7  3  2           8
8  3  3           9

So what are the questions? This approach seems not very scalable. For each comparison operator I seem to need to define a completely new function. Can the comparison operator be a variable? Currently, I'm only supporting 2 conditions concatenated with the & operator. How to generalize that? I'm unsure about calling the apply method. I feel that there should be a simpler way.

Any help is welcome

pansen · Answer 1 · 2017-03-02T10:54:08.110

You could make use of pandas.DataFrame.eval here. First, define a dictionary trans which contains your transformations you want to apply. Second, employ a helper function apply which takes advantage of the eval:

trans = {"C": {"A == 1 and B > 1": 1,
               "B == 1 and A > 1": 2,
               "A > 1 and B > 1": 3,
               "A == 1 and B == 1": 4}}

def apply(sub_df, trans_dict):
    # sub_df = sub_df.copy() # in case you don't want change original df
    for column, transforms in trans_dict.items():
        for transform, value in transforms.items():
            sub_df.loc[sub_df.eval(transform), column] = value

    return sub_df

apply(df, trans)

    A   B   ItemNumber  C
0   1   1   1           4.0
1   1   2   2           1.0
2   1   3   3           1.0
3   2   1   4           2.0
4   2   2   5           3.0
5   2   3   6           3.0
6   3   1   7           2.0
7   3   2   8           3.0
8   3   3   9           3.0

I think it is justified to use pandas' eval here for greater readability. You can now supply whatever column-condition-value combinations to your trans dict.

However, we are still kind of violating DRY because every atomic condition like A == 1 is evaluated multiple times instead of only once as in your provided example. But I guess there will be workarounds to efficiently memorize those boolean series.

hey, I really like your idea, it's straight forward, the mapping is defined relatively precisely. However, I'm not yet sure if I'm looking for a more functional solution. I think, my solution can be improved using the operator module and the reduce function to allow arbitrary many conditions. But the mapping/transformation might be less readable. — Quickbeam2k1, Mar 02 '17 at 15:09

Create new column depending on other columns: DRY and go functional

1 Answers1