0

How can I easily compare the distributions of multiple cohorts?

Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.

It was created as:

SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender

where compress_distributionUDF simply takes a list of tuples and returns the counts per group.

This leaves me with a list of

Row(distribution_value=60.0, count=314251, target_y_n=0)

nested inside a pandas.Series, but one per each chohort.

Basically, it is similar to:

pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})

and I wonder how to compare distributions:

  • within a cohort 0 vs. 1 of target_y_n
  • over multiple cohorts

in a way which is visually still understandable and not only a mess.

edit

For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Could you add another dataframe which represents your expected output. – Erfan Apr 08 '19 at 20:45
  • No the idea is not to have another dataframe but rather a visual comparison of the distributions. I.e. Similar to https://stackoverflow.com/questions/46045750/python-distplot-with-multiple-distributions but with the compressed distributions and not the raw values – Georg Heiler Apr 09 '19 at 04:21
  • would a qqplot do? You basically want to check the distributional difference between the groups: `target==0` and `target==1` and for each group you have the value and the count per value, correct? – CAPSLOCK Apr 09 '19 at 09:53
  • Indeed I have these values and would need to check. – Georg Heiler Apr 09 '19 at 09:57
  • Would seaborn have a simple way to achieve this? – Georg Heiler Apr 09 '19 at 09:59
  • Also, how would QQPlots work in case of this compressed data? – Georg Heiler Apr 09 '19 at 10:54
  • your `distribution_value` goes from 0 to 100? is it a percentile? and `count` is the count for that percentile or is a cumulative count? – CAPSLOCK Apr 09 '19 at 11:19
  • No, it is not it is a real double metric value. Most values are between 50 and 70 though. – Georg Heiler Apr 09 '19 at 11:21

1 Answers1

1

I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.

sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)

bars

sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)

box

plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])

bar

sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)

barplot

Finally try to have a look at the FacetGrid class to extend your comparison (see here).

g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

facetgrid

In your case you would have something like:

g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

barcohorttarget

And a qqplot option:

from scipy import stats
def qqplot(x, y, **kwargs):
     _, xr = stats.probplot(x, fit=False)
     _, yr = stats.probplot(y, fit=False)
 plt.scatter(xr, yr, **kwargs)

g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

qqplot

CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56
  • But that would only work for a single cohort i.e. I would have `n` such plots for each cohort. – Georg Heiler Apr 09 '19 at 12:21
  • @GeorgHeiler I gave you 7 options. Which one is "that"? – CAPSLOCK Apr 09 '19 at 12:28
  • @GeorgHeiler also, you can simply discriminate according to `cohort` instead of `target` but if you plot `n` different distributions in one plot it would probably be very messy – CAPSLOCK Apr 09 '19 at 12:30