23

I need to calculate the number of activity_months for each product in a pandas DataFrame. Here is my data and code so far:

from pandas import DataFrame
from datetime import datetime
data = [
('product_a','08/31/2013')
,('product_b','08/31/2013')
,('product_c','08/31/2013')
,('product_a','09/30/2013')
,('product_b','09/30/2013')
,('product_c','09/30/2013')
,('product_a','10/31/2013')
,('product_b','10/31/2013')
,('product_c','10/31/2013')
]

product_df = DataFrame( data, columns=['prod_desc','activity_month'])

for index, row in product_df.iterrows():
  row['activity_month']= datetime.strptime(row['activity_month'],'%m/%d/%Y')
  product_df.loc[index, 'activity_month'] = datetime.strftime(row['activity_month'],'%Y-%m-%d')

product_df = product_df.sort(['prod_desc','activity_month'])

product_df['month_num'] = product_df.groupby(['prod_desc']).size()

However, this returns NaNs for month_num.

Here is what I want to get:

prod_desc    activity_month   month_num 
product_a       2014-08-31         1 
product_a       2014-09-30         2         
product_a       2014-10-31         3         
product_b       2014-08-31         1 
product_b       2014-09-30         2         
product_b       2014-10-31         3         
product_c       2014-08-31         1 
product_c       2014-09-30         2         
product_c       2014-10-31         3     
analyticsPierce
  • 2,979
  • 9
  • 57
  • 81
  • you modifying values when iterating that is a no no in python (it can work as iter rows will in a single dtype case return a view), but in general a bad idea); always return a new frame (or copy and modify the copy) – Jeff May 21 '14 at 19:26
  • use pd.to_datetime() to convert your dates all in one shot – Jeff May 21 '14 at 19:29
  • It's not yet clear to me what you want to achieve: Shall `month_num` simply be equal to the month in `activity_month`? What's your ultimate goal? – ojdo May 21 '14 at 19:30
  • @ojdo good point. I'll edit the example to be more clear. I am interested in counting the activity_months. This has nothing to do with what month it is. If there are 5 activity_months for a product I need the row counts to go from 1 to 5 within that group. I will be adding logic for separate calculations for the first month, the second month, etc... – analyticsPierce May 21 '14 at 19:34

1 Answers1

32

The groupby is the right idea, but the right method is cumcount:

>>> product_df['month_num'] = product_df.groupby('product_desc').cumcount()
>>> product_df

  product_desc activity_month  prod_count    pct_ch  month_num
0    product_a     2014-01-01          53       NaN          0
3    product_a     2014-02-01          52 -0.018868          1
6    product_a     2014-03-01          50 -0.038462          2
1    product_b     2014-01-01          44       NaN          0
4    product_b     2014-02-01          43 -0.022727          1
7    product_b     2014-03-01          41 -0.046512          2
2    product_c     2014-01-01          36       NaN          0
5    product_c     2014-02-01          35 -0.027778          1
8    product_c     2014-03-01          34 -0.028571          2

If your really want it to start with 1 then just do this instead:

>>> product_df['month_num'] = product_df.groupby('product_desc').cumcount() + 1

  product_desc activity_month  prod_count    pct_ch  month_num
0    product_a     2014-01-01          53       NaN          1
3    product_a     2014-02-01          52 -0.018868          2
6    product_a     2014-03-01          50 -0.038462          3
1    product_b     2014-01-01          44       NaN          1
4    product_b     2014-02-01          43 -0.022727          2
7    product_b     2014-03-01          41 -0.046512          3
2    product_c     2014-01-01          36       NaN          1
5    product_c     2014-02-01          35 -0.027778          2
8    product_c     2014-03-01          34 -0.028571          3
Karl D.
  • 13,332
  • 5
  • 56
  • 38