While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean()
,np.std()
(with ddof=0 or 1)statistics.mean()
,statistics.pstdev()
(and/orstatistics.stdev
)scipy.statistics
package
That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean()
vs statistics.mean()
. It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy
array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy
functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics
module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]
. Is that still a fair statement? Are there any numpy
enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.