1

I'm using python and I've pulled in numpy/scipy as dependencies. It's OK to pull in more if they're well-tested and so on.

Suppose I've got a dataset with a relatively small number of distinct values, each of which has a high multiplicity. I'll represent it as a map (Value -> Multiplicity), say something like

{ 1: 10000, 5: 100000, 6: 73452 }

I need to do some basic statistics here, say the mean and variance. There are two obvious answers here:

  1. Unroll the map into a massive array [1, 1, 1, 1...., 5, 5, 5, ...., 6, 6, 6 ...] and call np.mean and np.var and so on.
  2. Write the statistics by hand

These methods have pros and cons.

  1. has the advantage of simplicity, and it'll fairly obviously work; but the time and memory costs are significant (in my use case, this will often take a map of size 1000 and turn it into a list of size >10,000,000).

  2. is fairly easy because the formulas can easily be looked up, but it's a little uncomfortable to not be able to use library methods. I could typo something, miss a special case, .... Generally I prefer to use libraries when they're available.

Is there a way within the numpy/scipy stack to do statistics on sets of values with multiplicities?

Richard Rast
  • 1,772
  • 1
  • 14
  • 27
  • 1
    For the weighted mean there is [`np.average`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html#numpy.average) – Paul Panzer Feb 27 '18 at 16:23
  • In [an answer to a different question](https://stackoverflow.com/questions/47440513/fit-normal-distribution-to-weighted-list/47444047#47444047), I show an implementation of the weighted variance; see the line beginning with `In [64]`. But this is just a version of your option 2. – Warren Weckesser Feb 27 '18 at 16:43
  • 2
    https://stackoverflow.com/a/36464881/9334376 – kuppern87 Feb 27 '18 at 17:31

0 Answers0