1

I have a dataset that I'm grouping, and then trying to remove any groups that have no data in a particular column. For example:

df = pd.DataFrame{'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
                  'rating': [3., 4., 5., np.nan, np.nan, np.nan],
                  'name': ['John', np.nan, 'Terry', 'Graham', 'Eric', np.nan]}
g = df.groupby('movie')

  movie    name  rating
0   thg    John       3
1   thg     NaN       4
2   mol   Terry       5
3   mol  Graham     NaN
4   lob    Eric     NaN
5   lob     NaN     NaN

I would like to remove the group lob from the dataset, as nobody has rated it. I've tried

mask = g['rating'].mean().isnull()
g.filter(~mask)

which gives me an error of TypeError: 'Series' object is not callable. That's kind of hackish, so I've also tried

g.filter(lambda group: group.isnull().all())

which seems more Pythonic, but it gives me an error of ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). How can I filter out a group, and why do I get these errors? Any additional information about groupby in general would also be helpful. I'm using pandas 0.12.0, Python 2.7.5, and Mac OS X 10.8.5.

  • EDIT: Never mind not a duplicate but this might help. http://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-of-certain-column-is-nan – Shashank Sep 22 '13 at 22:13

1 Answers1

0

If you want to filter the group you can do this:

g = df.groupby('movie').count()
g = g[g['rating']>0]

Out[14]:
          movie name rating
    movie           
    mol     2   2   1
    thg     2   1   2

Or you can filter the df first and then group

g = df[df['rating'].notnull()].groupby('movie').count()

This will affect the final ratings:

Out[15]:
      movie name rating
movie           
mol     1   1   1
thg     2   1   2

So mol has a lower movie and name count compared to the above but rating is the same

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Unfortunately, that just filters out ALL records with a NaN rating. I want to filter out all groups that have ONLY NaN ratings. If a group has even one non-NaN rating, I want to keep the whole group. (And eventually impute the mean to those NaN values.) – Tim Lewandowski Sep 23 '13 at 01:50
  • The top level code by removing the NaN means that the group does not appear in the group by how is this different to what you want? Or do you want to keep the total count of the group so that you can impute the mean? – EdChum Sep 23 '13 at 07:22
  • @TimLewandowski I'm not understanding after re-reading your comment why the second code snippet does not do what you want as you will filter out all the rows that have NaN ratings. So as long as you have a single rating for a movie then it will feature in the groupby operation. – EdChum Sep 23 '13 at 07:54
  • After removing groups without any ratings, I want to be able to fill in the NaN ratings with the mean rating for each group. However, I need to have at least one rating to be able to do that. So I want something that will drop the `lob` group, but keep every record of both the `mol` and `thg` group. – Tim Lewandowski Sep 23 '13 at 13:02
  • @TimLewandowski the first code snippet would retain all records even those with NaN for groups that have at least one rating, you could then impute the mean for the rating from this I would have thought? – EdChum Sep 23 '13 at 13:12
  • I actually can't get your code to give me back a DataFrame with the original data, just the count data. I tried `df[g['rating']] > 0`, but that gave me `IndexingError: Unalignable boolean Series key provided`, which has been similar to what's happening when I've tried other ways to filter by group - the indices are not aligned. – Tim Lewandowski Sep 23 '13 at 15:32
  • @TimLewandowski yes that would not work, I think what you want is just g = df[df['rating'].notnull()].groupby('movie').mean(). This selects just the ratings that are not NaN, groups by movie and calcs mean. Does this now do what you want? – EdChum Sep 23 '13 at 15:44
  • That's the workaround I came up with. I'd still like to figure out how to get the original dataframe back with some groups dropped. – Tim Lewandowski Sep 23 '13 at 21:29