0

I'm trying to figure out how to work with pre-aggregated data in pandas/matplotlib. I'm extracting my data from Kibana/ElasticSearch, so it's not raw data it's already been aggregated into buckets.

Some example data looks like this (actual data has many more categories and buckets that go up to 40).

Category,Bucket,Count
A,0,134563
B,0,215777
C,0,149918
A,1,183394
B,1,430333
C,1,234846
A,2,301137
B,2,604825
C,2,369665
A,3,385299
B,3,638058
C,3,471866

I realized since the data is already aggregated, I can't use any distribution plots, but I can plot the above data in a generic bar chart to see the distribution. That works.

What I want to do now is pull out stats like mean/median (per Category) and other stats from describe(), and also plot these on a boxplot.

How can I "de-aggregate" my data or otherwise transform it back to raw data so that I can more naturally work with it?

I got a hint from Pandas get median/average of pre-aggregated data about using np.repeat() to expand my counts back to raw data. My counts are way too high for that but I figure I can divide by 10 or 100 to get a reasonable approximation.

So I think I understand what I want to do, I just can't make np/pandas pull this off.

np.repeat(df['Bucket'], df['Count'] / 10).describe()

count    411961.000000
mean          1.914108
std           1.023361
min           0.000000
25%           1.000000
50%           2.000000
75%           3.000000
max           3.000000

# Think that's working?  But now how do I break it down by Category?
byCat = df.groupby('Category')
np.repeat(byCat['Bucket'], byCat['Count'] / 10).describe()

TypeError: unsupported operand type(s) for /: 'SeriesGroupBy' and 'int'

ack
  • 14,285
  • 22
  • 55
  • 73
  • 2
    Do you really have to disaggregate the data? All of those values can be calculated by considering the number of observations as the weights. There's a whole post on how to calculate weighted percentiles: https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy and the statsmodels library has an entire part dedicated to weighted statistics https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html. Sure you can `repeat` for smaller datasets, but it becomes infeasible for larger data. – ALollz Jan 12 '22 at 20:04
  • Does this answer your question? [Matplotlib boxplot using precalculated (summary) statistics](https://stackoverflow.com/questions/23655798/matplotlib-boxplot-using-precalculated-summary-statistics) – Michael Delgado Jan 12 '22 at 20:11
  • Maybe `pd.pivot_table('Count', index='Bucket', columns='Category')` is helpful? It is hard to get means if you don't have any values to start from. – JohanC Jan 12 '22 at 20:16

1 Answers1

0

You could group by Category and then calculate the statistics for each:

import pandas as pd
import numpy as np

for cat, df_cat in df.groupby('Category'):
    print(f'\nCategory: {cat}')
    print(np.repeat(df_cat['Bucket'], df_cat['Count'] / 10).describe())

Output:

Category: A
count    100437.000000
mean          1.933072
std           1.047681
min           0.000000
25%           1.000000
50%           2.000000
75%           3.000000
max           3.000000
Name: Bucket, dtype: float64

Category: B
count    188897.000000
mean          1.881512
std           1.004221
min           0.000000
25%           1.000000
50%           2.000000
75%           3.000000
max           3.000000
Name: Bucket, dtype: float64

Category: C
count    122627.000000
mean          1.948788
std           1.030864
min           0.000000
25%           1.000000
50%           2.000000
75%           3.000000
max           3.000000
Name: Bucket, dtype: float64
JohanC
  • 71,591
  • 8
  • 33
  • 66