How do I identify an n-sigma event in a sample?

Question

This question borders on a mathematics question but the reason I'm asking it here is because I want a solution using boost. Please let me know if you think this would be better suited to the SE Maths.

I have a sample of error values from a set of arbitrary algorithms;

std::vector<double> errors {/* some values */};

Assuming a normal distribution of the values in errors, I need an algorithm that tells me the floating point value below which any number constitutes at least an n-sigma event. Using the 68–95–99.7 rule, if n were 2 then I would want to know the number below which there is at most a 5% chance of the number existing in the dataset.

double getSigmaEventValue(const std::vector<double>& container, int n);

Now, I have a suspicion that this problem is already solved for me in the boost accumulator library but I lack the mathsy know-how to figure out exactly what I'm looking for.

I know I can get the variance using boost::accumulators::variance, but I'm not aware of any wizardry I can employ to convert a variance to an n-sigma value, so that might not be the best approach. I'm interested in using boost because I already perform a set of time-critical statistics on this dataset (median, mean, variance, min and max) so it's likely that at least some of the calculations required for this will already have been cached.

http://stackoverflow.com/questions/5565228/quantile-functions-in-boost-c — David Heffernan, Feb 28 '15 at 07:52
I don't think you've got a solid grasp of what this library is doing if you hope it's going to cache stuff for you. It's highly unlikely that the trivial calculation of these stats is your perf bottleneck. Acquiring the data will surely cost more. Finally, writing code without a good understanding of the stats is a bad move. Slow down a little. Step back. Understand fully what you are doing. You'll get there sooner if you do that. — David Heffernan, Feb 28 '15 at 07:55
I suppose, you need the standard deviation: `sigma = sqrt(accumulators::moment<2>(acc))` — Nikerboker, Feb 28 '15 at 07:58
@DavidHeffernan I shouldn't have mentioned the performance thing, it's really a passing thought. I read through the `boost::accumulator` docs and it mentioned caching values from prior requests, so I thought it would be neat to reuse that data (I had a look at the boost code for the `median` implementation and it uses the `mean` header so presumably that part of the median calculation is "free" if you've already requested the mean previously). — quant, Feb 28 '15 at 07:59
@DavidHeffernan I think the quantile stuff is what I'm after; I'll take a look, thanks! — quant, Feb 28 '15 at 08:02
@quant did you read the link I gave you, answer is there. I trust you already know relationship between stddev and var. Ok, our comments crossed. Yes, quantiles of the normal dist are what you need. — David Heffernan, Feb 28 '15 at 08:02
That said, I wonder why you believe that your errors are always normal. You'd be wise to make some qq plots to check. Use a good interactive tool like R to help. — David Heffernan, Feb 28 '15 at 08:04
@DavidHeffernan I don't, but I'm happy to assume they are for the sake of guestimating the sigma event. A normal distribution is one of the assumptions for this, right? — quant, Feb 28 '15 at 08:06
@quont Your numbers will be mean nothing if the data is not normal. You'll just be pretending to give information. Anyone using it will draw the wrong conclusions. — David Heffernan, Feb 28 '15 at 08:09

score 1 · Accepted Answer · edited May 23 '17 at 12:28

If your data is normally distributed then calculate the sample mean and sample variance. This defines is your fitted distribution. Then calculate quantiles for that distribution. For instance, this question covers that topic from the perspective of Boost: Quantile functions in boost (C++)

Of course, if your data is not normally distributed, and you apparently have no reason to believe it is, then any your proposed calculations will be meaningless.

How do I identify an n-sigma event in a sample?

1 Answers1