14

While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):

  • np.mean(), np.std() (with ddof=0 or 1)
  • statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
  • scipy.statistics package

That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here: why-is-statistics-mean-so-slow

I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.

It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?

Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.

My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.

Hcorg
  • 11,598
  • 3
  • 31
  • 36
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • 1
    `statistics` doesn't require a heavy, external dependency. If you're already using `numpy`, I can't think of any reason to use `statistics` off the top of my head. – roganjosh Jan 03 '19 at 20:56
  • Related / possible duplicate of [What are the advantages of NumPy over regular Python lists?](https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists) - Many of the arguments are the same here. – jpp Jan 03 '19 at 20:58
  • 1
    Here's a related good [post](https://stackoverflow.com/a/37533799/2285236). – ayhan Jan 03 '19 at 21:03
  • @jpp, thanks. You don't have to convince me (about numpy vs lists) I'm already a numpy guy. I abandoned lists months ago for most of my work. – kcw78 Jan 03 '19 at 22:01
  • @ayhan, I found that post. I was curious if anything's been done to improve the accuracy of the np sum function (which apparently is why statistics.mean is slower, but more accurate in some cases). Guess a 'little testing' is in order -- 100 data sets with 1,000 data points. :-) – kcw78 Jan 03 '19 at 22:01
  • There is also a `scipy.statistics` package. – hpaulj Jan 03 '19 at 22:05
  • For something as simple, conceptually, as `mean`, there should be multiple way of doing it. Plain Python has `sum`. There's also a `math.fsum`. Maybe you have a list of `Decimals`. Or other numeric objects that can't be converted into numeric `ndarray`. Or you have a `ndarray` with `nan` that need to be ignored. Or the data is in a `pandas` dataframe. – hpaulj Jan 03 '19 at 22:21
  • @hpaulj, I need mean and standard deviation, and I need to do it on 100s of datasets w/ 1000s of points. The data is already in a np array, so it's a snap to call either function. I better not get `nan` (if so, there are way bigger problems). Eventually I also need to apply a mask to the array before doing the calculations (still have to learn how to do that). – kcw78 Jan 03 '19 at 22:31
  • FWIW my advice is to go ahead and use numpy since you're already using that. Numpy has lots and lots of users, so bugs in basic functions such as mean and sd are likely to have been found and squashed already. (Maybe browse the numpy bug reports if you are interested.) Also, numpy functions are probably optimized for numpy arrays which I gather you are using. – Robert Dodier Jan 03 '19 at 23:45
  • If you have datasets with missing values (nan) or nodata values, there are a whole load of equivalent nanfunctions (ie nanmean() – NaN Jan 03 '19 at 23:55
  • Thanks for all the feedback. As @Robert pointed out, `np.mean` and `np.std` are the most convenient (since I'm already using numpy). I was looking for advice on the statistics side to confirm that using them (in lieu of the statistics module) is not an "error of convenience". My first set of tests shows differences in the 4th significant digit. I can live with that. – kcw78 Jan 04 '19 at 02:38
  • I would advice to take a look at https://pypi.org/project/accupy/ – Severin Pappadeux Jan 05 '19 at 02:13
  • @Severin Pappadeux: good info. Any idea where the statistic and scipy modules fall in the accuracy vs speed comparisons? My values are normalized (-1.0 to 1.0), so expect np.sum operator will be well behaved. – kcw78 Jan 06 '19 at 21:05
  • statistics median is 10x faster for me in my current use case – Liam McIntyre Aug 19 '20 at 04:42

2 Answers2

8

Why does NumPy duplicate features of SciPy?

From the SciPy FAQ What is the difference between NumPy and SciPy?:

In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.

It recommends using SciPy over NumPy:

In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.

When should I use the statistics library?

From the statistics library documentation:

The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.

Thus I would not use it for serious (i.e. resource intensive) computation.

What is the difference between statsmodels and SciPy?

From the statsmodels about page:

The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.

Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library. For example the SciPy documentation for scipy.stats.probplot notes that

Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.

Thus in cases like these you will need to turn to statistical libraries beyond SciPy.

rlchqrd
  • 519
  • 4
  • 11
3

Just a simple benchmark.

statistics vs NumPy

import statistics as stat
import numpy as np 
from timeit import default_timer as timer 

l = list(range(1_000_000)) 

start = timer()
m, std = stat.mean(l), stat.stdev(l) 
end = timer()
print(end-start, "sec elapsed.") 
print("mean, std:", m, std) 


start = timer()
m, std = np.mean(l), np.std(l) 
end = timer()
print(end-start, "sec elapsed.") 
print("mean, std:", m, std) 

(no iteration for brevity.)

Result

NumPy is way faster.

# statistics
3.603845 sec elapsed.
mean, std: 499999.5 288675.2789323441

# numpy
0.2315261999999998 sec elapsed.
mean, std: 499999.5 288675.1345946685
starriet
  • 2,565
  • 22
  • 23
  • Yes, I knew numpy is faster. You tested with well conditioned numbers. Read this answer for [Why is statistics.mean() slow?](https://stackoverflow.com/a/37533799/10462884) TL;DR: it has to do with floats of wildly differing magnitude. – kcw78 Jun 29 '22 at 13:30
  • @kcw78 Oh, I didn't think you, the OP, will see this ;) This is just a simple comparison to see the magnitude of the speed difference - if this helps anyone. Thanks for the comment and the link ;D – starriet Jun 29 '22 at 14:40