1

I have a pandas dataframe

x = pd.DataFrame({
    "t": ["A", "A", "A", "B", "B", "B", "B", "A", "C", "C"],
    "m": ["k", "m", "m", "k", "m", "m", "b", "k", "f", "d"],
    "f1": [1.2, np.nan,  0.8, 1, 1, 1.5, 1, np.nan, np.nan, np.nan],
    "f2": [100, 200, 200, 100, 100, 100, 50, 200, 300, 400]})
   t  m   f1   f2
0  A  k  1.2  100
1  A  m  NaN  200
2  A  m  0.8  200
3  B  k  1.0  100
4  B  m  1.0  100
5  B  m  1.5  100
6  B  b  1.0   50
7  A  k  NaN  200
8  C  f  NaN  300
9  C  d  NaN  400

On f1 I want to pick min value of each subgroup t. If a subgroup has Nan values, we need to pick all the values. On f2 I want to pick max values. In case of Nan values, we need to keep all the rows. So I want an output something similar to

   t  m   f1   f2
0  A  m  0.8  200
1  B  k  1.0  100
2  B  m  1.0  100
3  C  d  NaN  400

I was able to achieve this using

def keep_rows(k, col, op):
    # if all values are Nan 
    if np.isnan(k[col].values).all():
        return k 
    return k[k[col] == getattr(np, f"nan{op}")(k[col])]
tt = x.groupby("t", as_index=False).apply(lambda x: keep_rows(x, "f1", "min")).reset_index(drop=True)
tt = tt.groupby("t", as_index=False).apply(lambda x: keep_rows(x, "f2", "max")).reset_index(drop=True)
tt

But is there a better way? Especially in pandas v2?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Prakash Vanapalli
  • 677
  • 1
  • 9
  • 16
  • this is not a duplicate. – Prakash Vanapalli May 31 '23 at 03:19
  • Then please explain the logic and how this is not a duplicate – mozway May 31 '23 at 03:49
  • firstly this is on multiple columns and need different operations on both of them and there can be multiple eligible rows. need a subset of it. I found the other question to be verbose and most of the answers are wrong. Also want to see if pandas v2 has any elegant solution for this. – Prakash Vanapalli May 31 '23 at 04:59
  • This is not very clear in your question, nevertheless see the answer below if you need simultaneous conditions. If you want **successive** conditions, then this falls back to the duplicate. – mozway May 31 '23 at 05:23
  • Warm welcome to SO. Please try to use correct upper case letters, e.g. in the beginning of your title, sentences or the word "I". This would be gentle to your readers. Please read [How to ask](https://stackoverflow.com/help/how-to-ask) and [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Then update your question with code to show us what you have tried so far. – buhtz May 31 '23 at 10:42
  • 2
    @Prakash tbf it was a duplicate originally, but now it's totally different yeah – wjandrea May 31 '23 at 16:22

1 Answers1

1

If the operations are simultaneous, you need a double groupby.transform:

updated question

g = x.assign(flag=x['f1'].isna()).groupby('t')

# does the row have the min f1 or its group?
m1 = g['f1'].transform('min').eq(x['f1'])
# does the group have all NaN in f1?
m2 = g['flag'].transform('all')
# does the row have the max f2 or its group?
m3 = g['f2'].transform('max').eq(x['f2'])

out = x.loc[(m1 | m2) & m3]

Output:

   t  m   f1   f2
2  A  m  0.8  200
3  B  k  1.0  100
4  B  m  1.0  100
9  C  d  NaN  400

original question

g = x.groupby('t')

m1 = g['f1'].transform('min').eq(x['f1'])
m2 = g['f2'].transform('max').eq(x['f2'])

out = x.loc[m1 & m2]

If the operations are consecutive, then you fall back to a known problem, just chain two selections.

Output:

   t  m   f1   f2
1  A  m  0.8  200
2  B  k  1.0  100
3  B  m  1.0  100
wjandrea
  • 28,235
  • 9
  • 60
  • 81
mozway
  • 194,879
  • 13
  • 39
  • 75