I have a pandas dataframe
x = pd.DataFrame({
"t": ["A", "A", "A", "B", "B", "B", "B", "A", "C", "C"],
"m": ["k", "m", "m", "k", "m", "m", "b", "k", "f", "d"],
"f1": [1.2, np.nan, 0.8, 1, 1, 1.5, 1, np.nan, np.nan, np.nan],
"f2": [100, 200, 200, 100, 100, 100, 50, 200, 300, 400]})
t m f1 f2
0 A k 1.2 100
1 A m NaN 200
2 A m 0.8 200
3 B k 1.0 100
4 B m 1.0 100
5 B m 1.5 100
6 B b 1.0 50
7 A k NaN 200
8 C f NaN 300
9 C d NaN 400
On f1
I want to pick min value of each subgroup t
. If a subgroup has Nan values, we need to pick all the values. On f2
I want to pick max values. In case of Nan values, we need to keep all the rows. So I want an output something similar to
t m f1 f2
0 A m 0.8 200
1 B k 1.0 100
2 B m 1.0 100
3 C d NaN 400
I was able to achieve this using
def keep_rows(k, col, op):
# if all values are Nan
if np.isnan(k[col].values).all():
return k
return k[k[col] == getattr(np, f"nan{op}")(k[col])]
tt = x.groupby("t", as_index=False).apply(lambda x: keep_rows(x, "f1", "min")).reset_index(drop=True)
tt = tt.groupby("t", as_index=False).apply(lambda x: keep_rows(x, "f2", "max")).reset_index(drop=True)
tt
But is there a better way? Especially in pandas v2?