1

Python version: 3.7.3

Something similar was asked here, but it's not quite the same.

Based on a condition, I would like to retrieve only a subset of each group of the DataFrameGroupBy object. Basically, if a DataFrame starts with rows with only NANs, I want to delete those. If it isn't the case, I want the entire DataFrame to keep intact. To accomplish this, I wrote a function delete_rows.

Grouped_object = df.groupby(['col1', 'col2']) 

def delete_rows(group):
  pos_min_notna = group[group['cumsum'].notna()].index[0]
  return group[pos_min_notna:]

new_df = Grouped_object.apply(delete_rows)

However, this function seems to only do the "job" for the first group in the DataFrameGroupBy object. What am I missing, so it does this for all the groups and "glues" the subsets together?

Function delete_rows edited according to logic as provided by Laurens Koppenol

Anonymous
  • 502
  • 4
  • 23
  • 1
    why not always return `group[pos_min_notna:]`? Which is the first row which is not missing, possibly being the first row in the group (iloc 0) – Laurens Koppenol Jul 23 '19 at 11:30
  • You're absolutely correct, I should indeed do that. So the function can be reduced to your logic. However, having done that, it still only returns data from the first group within the DataFrameGroupBy. Any suggestion for that? I am obviously missing something here, but can't find it – Anonymous Jul 23 '19 at 11:35
  • Not sure why it works for the first group only, but if you are asking for an alternative solution, you should provide a dataset. Otherwise is difficult for us to test any solution. – Valentino Jul 23 '19 at 11:44
  • @Valentino the answer provided works. The problem was not using `.loc`. – Anonymous Jul 23 '19 at 12:08

1 Answers1

3

In Pandas you have to be very careful with index (loc) and index locations (iloc). It is always a good idea to make this explicit.

This answer has a great overview of the differences

Grouped_object = df.groupby(['col1', 'col2']) 

def delete_rows(group):
  pos_min_notna = group[group['cumsum'].notna()].index[0]  # returns value of the index = loc
  return group.loc[pos_min_notna:]  # make loc explicit

new_df = Grouped_object.apply(delete_rows)  # this dataframe has a messed up index :)

Minimal example Showing the unwanted behavior

df = pd.DataFrame([[1,2,3], [2,4,6], [2,4,6]], columns=['a', 'b', 'c'])

# Drop the first row of every group
df.groupby('a').apply(lambda g: g.iloc[1:])

# Identical results as:
df.groupby('a').apply(lambda g: g[1:])

# Return anything from any group with index 1 or higher
# This is nonsense with a static index in a sorted df. But examples huh
df.groupby('a').apply(lambda g: g.loc[1:])


Laurens Koppenol
  • 2,946
  • 2
  • 20
  • 33
  • Great, just for my understanding. So because I did not explicitly specify `loc` or `iloc`, by using `group[pos_min_notna:]` it uses iloc? Your solution works perfectly, thanks for the help! – Anonymous Jul 23 '19 at 12:06