Get the variance of a "list" of values with multiplicities?

Question

I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on.

Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say something like

{ 1: 10000, 5: 100000, 6: 73452 }

I need to do some basic statistics here, say the mean and variance. There are two obvious answers here:

Unroll the map into a massive array [1, 1, 1, 1...., 5, 5, 5, ...., 6, 6, 6 ...] and call np.mean and np.var and so on.
Write the statistics by hand

These methods have pros and cons.

has the advantage of simplicity, and it'll fairly obviously work; but the time and memory costs are significant (in my use case, this will often take a map of size 1000 and turn it into a list of size >10,000,000).
is fairly easy because the formulas can easily be looked up, but it's a little uncomfortable to not be able to use library methods. I could typo something, miss a special case, .... Generally I prefer to use libraries when they're available.

Is there a way within the numpy/scipy stack to do statistics on sets of values with multiplicities?

For the weighted mean there is [`np.average`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html#numpy.average) — Paul Panzer, Feb 27 '18 at 16:23
In [an answer to a different question](https://stackoverflow.com/questions/47440513/fit-normal-distribution-to-weighted-list/47444047#47444047), I show an implementation of the weighted variance; see the line beginning with `In [64]`. But this is just a version of your option 2. — Warren Weckesser, Feb 27 '18 at 16:43

Get the variance of a "list" of values with multiplicities?

0 Answers0