Groupby with Cumcount - Not Working As Expected

Question

I am looking to return a dataframe for the last 5 times a Product was sold & I'm running into issues.

Here is my dataframe:

np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',  
                      freq='D'), 10000)})

I thought I could sort the Dataframe by date then use .cumcount() to create a helper column to later filter by. Here's what I tried:

df = df.sort_values('Date',ascending=False)
df['count_product'] = df.groupby(['Date','Product']).cumcount() + 1
df2 = df.loc[df.count_product < 5]

This does not work as intended. Based on the data above, I would have expected Product 1 to the following dates included in the new dataframe: 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-27. Product 3 would have the dates 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-26.

Any suggestions?

@W-B I'm looking for a dataframe that contains all columns that df does, but filtered down to the last 4 dates for each product. Please note that the same product could be sold multiple times in a day. — keg5038, Jan 24 '19 at 20:20
`cumcount < 4` would only give you 4 dates (and since you added one that makes it three), but you seem to want 5 for each product given your output (31, 30, 29, 28, 27). So is it 5? — ALollz, Jan 24 '19 at 20:36
sorry that was a typo - I'll fix above. Even fixing that, cumcount() does not work as I thought it would. — keg5038, Jan 24 '19 at 20:37
@keg5038 your groups are `['Product', 'Date']` so cumcount is a counter for the number of records you have for each product on each date. Your current filter is to ensure you have fewer than 4 records for each product on each date, which isn't what you want. — ALollz, Jan 24 '19 at 20:45

score 1 · Answer 1 · answered Jan 24 '19 at 20:27

1

Check with drop_duplicates then groupby with head, after filter we using merge

yourdf=df.drop_duplicates(['Product','Date']).groupby('Product').head(4)[['Product','Date']].merge(df)

answered Jan 24 '19 at 20:27

BENY

317,841
20
164
234

this does work - thank you! I'm still unclear why cumcount() does not work like it does in this [link] (https://stackoverflow.com/questions/23435270/how-to-add-sequential-counter-column-on-groups-using-pandas-groupby) – keg5038 Jan 24 '19 at 20:41
@keg5038 you have duplicate , so that cumcount dose not work – BENY Jan 24 '19 at 20:44

score 0 · Accepted Answer · answered Jan 24 '19 at 20:28

You can create a filter from the groupby:

s = df.groupby('Product').apply(lambda x: x.Date.ge(x.Date.drop_duplicates().nlargest(5).iloc[-1])).reset_index(0, True)

df2 = df.loc[s]

Just to check:

df2.groupby('Product').Date.agg(['min', 'max'])
                 min        max
Product                        
Product 1 2018-12-27 2018-12-31
Product 2 2018-12-27 2018-12-31
Product 3 2018-12-26 2018-12-31

Groupby with Cumcount - Not Working As Expected

2 Answers2