0

This is an extension of this question:

I have a Pandas dataframe such as:

dfq = pd.DataFrame({'COL1': ['USER1', 'USER1','USER2','USER2','USER2','USER3'], 
               'COL2' : ['MONTH1','MONTH2','MONTH1','MONTH1','MONTH2','MONTH1']
               })

Original Data Frame

In general, this means everytime a customer uses the service, a record is added to the table with the user ID and the month. I need to know in average how many times customer use the service per month.

I can count the month occurrences like:

dfq.groupby('COL2').count()

Count

But, how do I get the averages from there? Or is there a better way to do this?

My desired output would be something like this:

If I count the number of groups (months) and then divide by the total number of records I can get a raw average:

testcount = dfq.groupby('COL2').count()
len(dfq)
testcount/len(dfq)*100

Average per month

Which sort of gives me the answer but I find it to be a very raw process. Averages are not trustworthy, I'd like to be able to get some more stadistical information: medians and deviations for instance.

In other words, I would like what they did here but in their case they are calculating over numerical values while my values are strings. I need to get insights like: what is the median customer usage of the service per month.

I hope that is clear.

Thank you!

Wilmar
  • 558
  • 1
  • 5
  • 16
  • are you calculating the average per *user* or per *month*? – MattR May 26 '20 at 21:00
  • Hi @MattR, I am calculating the average per month – Wilmar May 26 '20 at 21:01
  • I am a little confused by your question, is 3 a desired output in this example? If not could you edit and provide your desired output? – Sebastian May 26 '20 at 21:10
  • @Sebastian, I just added some more information. I hope that makes it more clear. – Wilmar May 26 '20 at 21:34
  • df.groupby(['COL2'], as_index=False).count().describe() Is this anywhere near to your desired solution? I am stiil not sure what you need because you write about mean and then you calculate distribution. – Sebastian May 26 '20 at 21:47
  • You are not interested in customer as an individual right? – Sebastian May 26 '20 at 21:49
  • You could also plot the data after grouping/counting – Sebastian May 26 '20 at 21:50
  • Hi Sebastian. No, I am not interested on each individual person. You are right I could plot. I am trying to figure out a plot that actually makes sense. Any suggestion is welcome. Thanks! – Wilmar May 27 '20 at 19:53

1 Answers1

2

Consider starting with:

df=dfq.assign(cnt=1).groupby(list(dfq.columns), as_index=False)["cnt"].sum()

Then you open the window of opportunity for the operations you mentioned:

res_mean=df.groupby("COL2")["cnt"].apply(np.median)

res_std=df.groupby("COL2")["cnt"].apply(np.std)

Which returns accordingly:

COL2
MONTH1    1.0
MONTH2    1.0
Name: cnt, dtype: float64
COL2
MONTH1    0.471405
MONTH2    0.000000
Name: cnt, dtype: float64
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34