Pandas: Calculate Median of Group count over Columns

Question

This is an extension of this question:

I have a Pandas dataframe such as:

dfq = pd.DataFrame({'COL1': ['USER1', 'USER1','USER2','USER2','USER2','USER3'], 
               'COL2' : ['MONTH1','MONTH2','MONTH1','MONTH1','MONTH2','MONTH1']
               })

In general, this means everytime a customer uses the service, a record is added to the table with the user ID and the month. I need to know in average how many times customer use the service per month.

I can count the month occurrences like:

dfq.groupby('COL2').count()

But, how do I get the averages from there? Or is there a better way to do this?

My desired output would be something like this:

If I count the number of groups (months) and then divide by the total number of records I can get a raw average:

testcount = dfq.groupby('COL2').count()
len(dfq)
testcount/len(dfq)*100

Which sort of gives me the answer but I find it to be a very raw process. Averages are not trustworthy, I'd like to be able to get some more stadistical information: medians and deviations for instance.

In other words, I would like what they did here but in their case they are calculating over numerical values while my values are strings. I need to get insights like: what is the median customer usage of the service per month.

I hope that is clear.

Thank you!

I am a little confused by your question, is 3 a desired output in this example? If not could you edit and provide your desired output? — Sebastian, May 26 '20 at 21:10
@Sebastian, I just added some more information. I hope that makes it more clear. — Wilmar, May 26 '20 at 21:34
df.groupby(['COL2'], as_index=False).count().describe() Is this anywhere near to your desired solution? I am stiil not sure what you need because you write about mean and then you calculate distribution. — Sebastian, May 26 '20 at 21:47
Hi Sebastian. No, I am not interested on each individual person. You are right I could plot. I am trying to figure out a plot that actually makes sense. Any suggestion is welcome. Thanks! — Wilmar, May 27 '20 at 19:53

score 2 · Accepted Answer · answered May 26 '20 at 22:04

2

Consider starting with:

df=dfq.assign(cnt=1).groupby(list(dfq.columns), as_index=False)["cnt"].sum()

Then you open the window of opportunity for the operations you mentioned:

res_mean=df.groupby("COL2")["cnt"].apply(np.median)

res_std=df.groupby("COL2")["cnt"].apply(np.std)

Which returns accordingly:

COL2
MONTH1    1.0
MONTH2    1.0
Name: cnt, dtype: float64
COL2
MONTH1    0.471405
MONTH2    0.000000
Name: cnt, dtype: float64

answered May 26 '20 at 22:04

Grzegorz Skibinski

12,624
2
11
34

1

Thank you very much @Grzegorz Skibinski. I am going to give this a try and report back. – Wilmar May 27 '20 at 19:51

Pandas: Calculate Median of Group count over Columns

1 Answers1