I am looking to return a dataframe for the last 5 times a Product was sold & I'm running into issues.
Here is my dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I thought I could sort the Dataframe by date then use .cumcount() to create a helper column to later filter by. Here's what I tried:
df = df.sort_values('Date',ascending=False)
df['count_product'] = df.groupby(['Date','Product']).cumcount() + 1
df2 = df.loc[df.count_product < 5]
This does not work as intended. Based on the data above, I would have expected Product 1 to the following dates included in the new dataframe: 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-27. Product 3 would have the dates 2018-12-31, 2018-12-30, 2018-12-29, 2018-12-28, & 2018-12-26.
Any suggestions?