I'm trying to figure out how to work with pre-aggregated data in pandas/matplotlib. I'm extracting my data from Kibana/ElasticSearch, so it's not raw data it's already been aggregated into buckets.
Some example data looks like this (actual data has many more categories and buckets that go up to 40).
Category,Bucket,Count
A,0,134563
B,0,215777
C,0,149918
A,1,183394
B,1,430333
C,1,234846
A,2,301137
B,2,604825
C,2,369665
A,3,385299
B,3,638058
C,3,471866
I realized since the data is already aggregated, I can't use any distribution plots, but I can plot the above data in a generic bar chart to see the distribution. That works.
What I want to do now is pull out stats like mean/median (per Category) and other stats from describe()
, and also plot these on a boxplot.
How can I "de-aggregate" my data or otherwise transform it back to raw data so that I can more naturally work with it?
I got a hint from Pandas get median/average of pre-aggregated data about using np.repeat()
to expand my counts back to raw data. My counts are way too high for that but I figure I can divide by 10 or 100 to get a reasonable approximation.
So I think I understand what I want to do, I just can't make np/pandas pull this off.
np.repeat(df['Bucket'], df['Count'] / 10).describe()
count 411961.000000
mean 1.914108
std 1.023361
min 0.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 3.000000
# Think that's working? But now how do I break it down by Category?
byCat = df.groupby('Category')
np.repeat(byCat['Bucket'], byCat['Count'] / 10).describe()
TypeError: unsupported operand type(s) for /: 'SeriesGroupBy' and 'int'