9

I'm profiling some numeric time measurements that cluster extremely closely. I would like to obtain mean, standard deviation, etc. Some inputs are large, so I thought I could avoid creating lists of millions of numbers and instead use Python collections.Counter objects as a compact representation.

Example: one of my small inputs yields a collection.Counter like [(48, 4082), (49, 1146)] which means 4,082 occurrences of the value 48 and 1,146 occurrences of the value 49. For this data set I manually calculate the mean to be something like 48.2192042846.

Of course if I had a simple list of 4,082 + 1,146 = 5,228 integers I would just feed it to numpy.mean().

My question: how can I calculate descriptive statistics from the values in a collections.Counter object just as if I had a list of numbers? Do I have to create the full list or is there a shortcut?

chrisinmtown
  • 3,571
  • 3
  • 34
  • 43
  • (don't have time to write answer myself, but `np.average` has a weights parameter, and you can do stddev manually, see [here](http://stackoverflow.com/questions/2413522/weighted-standard-deviation-in-numpy) -- if anyone wants to write up an answer using that approach I'll delete this) – DSM Nov 13 '15 at 18:52

3 Answers3

9

collections.Counter() is a subclass of dict. Just use Counter().values() to get a list of the counts, and you can use the standard library staticstics.mean() function

import statistics

counts = Counter(some_iterable_to_be_counted)
mean = statistics.mean(counts.values())

Note that I did not call Counter.most_common() here, which would produce the list of (key, count) tuples you posted in your question.

If you must use the output of Counter.most_common() you can filter out just the counts with a generator expression:

mean = statistics.mean(count for key, count in most_common_list)

If you meant to calculate the mean key value as weighted by their counts, you'd do your own calculations directly from the counter values:

mean = sum(key * count for key, count in counter.items()) / counter.total())

Note: I used Counter.total() there, which is new in Python 3.10. In older versions. use sum(counter.values()).

For the median, use statistics.median():

import statistics

counts = Counter(some_iterable_to_be_counted)
median = statistics.median(counts.elements())

or, for key * value:

median = statistics.median(key * count for key, count in counts.items())
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
6

While you can offload everything to numpy after making a list of values, this will be slower than needed. Instead, you can use the actual definitions of what you need.

The mean is just the sum of all numbers divided by their count, so that's very simple:

sum_of_numbers = sum(number*count for number, count in counter.items())
count = sum(count for n, count in counter.items())
mean = sum_of_numbers / count

Standard deviation is a bit more complex. It's the square root of variance, and variance in turn is defined as "mean of squares minus the square of the mean" for your collection. Soooo...

total_squares = sum(number*number * count for number, count in counter)
mean_of_squares = total_squares / count
variance = mean_of_squares - mean * mean
std_dev = math.sqrt(variance)

A little bit more manual work, but should also be much faster if the number sets have a lot of repetition.

David Beauchemin
  • 231
  • 1
  • 2
  • 12
Jakub Wasilewski
  • 2,916
  • 22
  • 27
  • Like this one for brevity. Thanks @Martijn Pieters for clarifying Python's integer & float math features, please don't be mad that I'm accepting this answer :) – chrisinmtown Nov 13 '15 at 19:37
  • You could still use numpy and it might be faster: `array = np.array(list(C.items())); mean = np.sum( array[:, 0] * array[:, 1]) / np.sum(array[:, 1])` – JLT May 19 '21 at 13:15
-1

Unless you want to write your own statistic functions there is no prêt-à-porter solution (as far as I know).

So at the end you need to create lists, and the fastest way is to use numpy. One way to do it is:

import numpy as np

# One memory allocation will be considerably faster
# if you have multiple discrete values.
elements = np.ones(48+49)
elements[0:48] *= 4082
elements[48:] *= 1146

# Then you can use numpy statistical functions to calculate
np.mean(elements)
np.std(elements)
# ... 

UPDATE: Create elements from an existing collections.Counter() object

c = collections.Counter({48: 4082, 49: 1146})
elements = np.ones(sum(c.values()))
idx = 0
for value, occurrences in c.iteritems():
    elements[idx:idx + occurrences] *= value
    idx += occurrences
chrisinmtown
  • 3,571
  • 3
  • 34
  • 43
Kon Pal
  • 546
  • 1
  • 3
  • 13