First: Maybe this fits better in code review, but I think here are more pandas affine users. Feel free to move it if you think otherwise
It is often the case, that one wants to compute a new column out of existing ones: Consider the following pandas Dataframe
df = pd.DataFrame({'ItemNumber':np.arange(1, 10),
'A':[1,1,1,2,2,2,3,3,3], 'B': [1,2,3]*3})
print(df)
A B ItemNumber
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 7
7 3 2 8
8 3 3 9
Let's say we compute a new column 'C' via
df.loc[(df["A"] == 1) & (df["B"] > 1), 'C'] = 1
df.loc[(df["B"] == 1) & (df["A"] > 1), 'C'] = 2
df.loc[(df["A"] > 1) & (df["B"] > 1), 'C' ] = 3
df.loc[(df["A"] == 1) & (df["B"] == 1), 'C' ] = 4
This will perform quite fast also on large dataframes when compared to the iterative approach presented here. Particularly, the perfomance issues with that approach lead me to the code presented above.
However, this code violates the DRY principle. Copy pasting feels akward.
So let's go a bit more functional and define two curried functions:
def col_equals_value(col, value):
def filter_df(df):
return df[col] == value
return filter_df
def col_greater_value(col, value):
def filter_df(df):
return df[col] > value
return filter_df
from there on, we define our comparisons:
a1 = col_equals_value('A', 1)
b1 = col_equals_value('B', 1)
agt1 = col_greater_value('A', 1)
bgt1 = col_greater_value('B', 1)
Another function is needed to assign the values to a column:
def assign_value(cond_1, cond_2, value):
def assign_col_value(df, col):
df.loc[df.apply(cond_1, axis=1) & df.apply(cond_2, axis=1), col] =value
return assign_col_value
And finally, we can define the condtion-to-values mapping as
mapping = [(a1, b1, 4),
(a1, bgt1, 1),
(agt1, b1, 2),
(agt1, bgt1, 3)]
construct the assign_value_functions
m = [assign_value(x, y, z) for (x,y,z) in mapping]
and apply each function to the dataframe:
for f in m:
f(df, 'C')
print(df)
A B ItemNumber
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 7
7 3 2 8
8 3 3 9
So what are the questions? This approach seems not very scalable. For each comparison operator I seem to need to define a completely new function. Can the comparison operator be a variable? Currently, I'm only supporting 2 conditions concatenated with the & operator. How to generalize that? I'm unsure about calling the apply method. I feel that there should be a simpler way.
Any help is welcome