1

I have the pandas dataframe "data", and want to keep only the rows where the sum of "numb_people" per category "class" is at least 2.

This, however, throws an index error (the indices do not match anymore):

data = data[data.groupby('class').sum()['numb_people'] > 2]

How can I do this in a similarly simple manner?

TestGuest
  • 593
  • 1
  • 4
  • 16
  • Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/provide-a-reproducible-copy-of-the-dataframe-with-to-clipboard/52413247#52413247) – Trenton McKinney Oct 09 '19 at 01:46
  • 2
    `data[data.groupby('class').numb_people.transform('sum') > 2]` – rafaelc Oct 09 '19 at 01:47
  • If I do data = data[data.groupby('class').numb_people.transform('sum') > 2], is this thresholding the data by this criterion such that only classes with sum > 2 are left, or is this new data variable actually containing sums (which it should not)? – TestGuest Oct 09 '19 at 01:57
  • 1
    `groupby` expressions in pandas have the [`filter`](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#filtration) method that might make this code a little more elegant than using the `transform` method. It's pandas' closest equivalent to a SQL-like HAVING statement. – Jacob Turpin Oct 09 '19 at 02:11

1 Answers1

1

As @rafaelc said in comment:

idx = data.groupby('class').numb_people.transform('sum') > 2
print(data[idx])
oreopot
  • 3,392
  • 2
  • 19
  • 28