I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on.
Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say something like
{ 1: 10000, 5: 100000, 6: 73452 }
I need to do some basic statistics here, say the mean and variance. There are two obvious answers here:
- Unroll the map into a massive array
[1, 1, 1, 1...., 5, 5, 5, ...., 6, 6, 6 ...]
and callnp.mean
andnp.var
and so on. - Write the statistics by hand
These methods have pros and cons.
has the advantage of simplicity, and it'll fairly obviously work; but the time and memory costs are significant (in my use case, this will often take a map of size 1000 and turn it into a list of size >10,000,000).
is fairly easy because the formulas can easily be looked up, but it's a little uncomfortable to not be able to use library methods. I could typo something, miss a special case, .... Generally I prefer to use libraries when they're available.
Is there a way within the numpy/scipy stack to do statistics on sets of values with multiplicities?