comapring compressed distribution per cohort

Question

How can I easily compare the distributions of multiple cohorts?

Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.

It was created as:

SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender

where compress_distributionUDF simply takes a list of tuples and returns the counts per group.

This leaves me with a list of

Row(distribution_value=60.0, count=314251, target_y_n=0)

nested inside a pandas.Series, but one per each chohort.

Basically, it is similar to:

pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})

and I wonder how to compare distributions:

within a cohort 0 vs. 1 of target_y_n
over multiple cohorts

in a way which is visually still understandable and not only a mess.

edit

For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?

Could you add another dataframe which represents your expected output. — Erfan, Apr 08 '19 at 20:45
No the idea is not to have another dataframe but rather a visual comparison of the distributions. I.e. Similar to https://stackoverflow.com/questions/46045750/python-distplot-with-multiple-distributions but with the compressed distributions and not the raw values — Georg Heiler, Apr 09 '19 at 04:21
would a qqplot do? You basically want to check the distributional difference between the groups: `target==0` and `target==1` and for each group you have the value and the count per value, correct? — CAPSLOCK, Apr 09 '19 at 09:53
Also, how would QQPlots work in case of this compressed data? — Georg Heiler, Apr 09 '19 at 10:54
your `distribution_value` goes from 0 to 100? is it a percentile? and `count` is the count for that percentile or is a cumulative count? — CAPSLOCK, Apr 09 '19 at 11:19
No, it is not it is a real double metric value. Most values are between 50 and 70 though. — Georg Heiler, Apr 09 '19 at 11:21

CAPSLOCK · Accepted Answer · 2019-04-09T12:24:49.687

I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.

sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)

sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)

plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])

sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)

Finally try to have a look at the FacetGrid class to extend your comparison (see here).

g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

In your case you would have something like:

g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

And a qqplot option:

from scipy import stats
def qqplot(x, y, **kwargs):
     _, xr = stats.probplot(x, fit=False)
     _, yr = stats.probplot(y, fit=False)
 plt.scatter(xr, yr, **kwargs)

g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

But that would only work for a single cohort i.e. I would have `n` such plots for each cohort. — Georg Heiler, Apr 09 '19 at 12:21
@GeorgHeiler also, you can simply discriminate according to `cohort` instead of `target` but if you plot `n` different distributions in one plot it would probably be very messy — CAPSLOCK, Apr 09 '19 at 12:30

comapring compressed distribution per cohort

edit

1 Answers1

Linked