huge numerical errors when trying to normalize data

Question

I often process some data by programs where by data. To make is simple let us consider that data is a single series of numbers of same magnitude. When the numbers are unreasonably high it might be useful to normalize the data. One of common transformations is substracting the average from all values. After this transformation the transformed data will have the average zero.

Other common transformation which can be done after having zero average is dividing the data by their standard deviation. After appling this transformation the new data have unit variance.

When working with data normalized this way I expect that numerical errors should be smaller. However I seem to fail to do these transformations because the numerical errors appear even when I am trying to compute standard deviation.

Bellow is sample code in c# where I try to compute standard deviation. It can be easily seen even without statistical knowledge (of the formula) that the output of the program should be zero. (If data is array of constants then average of squares of data equals to square of averages.)

static double standardDeviation(double[] data)
{
    double sum = 0;
    double sumOfSquares = 0;
    foreach (double number in data)
    {
        sum += number;
        sumOfSquares += number * number;
    }
    double average = sum / data.Length;
    double averageOfSquares = sumOfSquares / data.Length;
    return Math.Sqrt(averageOfSquares - average * average);
}
static void Main(string[] args)
{
    double bigNumber = 1478340000000;
    double[] data = Enumerable.Repeat(bigNumber, 83283).ToArray();
    Console.WriteLine(standardDeviation(data));
}

Instead of zero the program outputs a huge number caused by numerical errors: 2133383.0308878

Note that if I would omit Math.Sqrt (i.e. I would be computing variance instead of standard deviation) then the error would be much higher.

What is the cause and how do I write this with smaler numerical errors?

score 1 · Accepted Answer · answered Sep 04 '17 at 08:57

1

Although the formula you use for variance is correct mathematically -- ie if you have infinite precision -- it can lead to trouble with finite precision.

A better way for N data X is to compute

variance = Sum{ square( X[i] - mean) }/ N

where

mean = Sum{ X[i] } /N

As written this requires two passes through the data. If this is awkward you can in fact do it in a single pass. You need to keep three variables, n (number of data items seen so far) mean and variance. These should all be initialised to 0 (aka 0.0). Then when you get the next data item x:

n = n + 1
f = 1.0/n
d = x-mean
mean = mean + f*d
variance = (1.0-f)*(variance + f*d*d)

At each stage after processing a data item n, mean, variance are indeed the count, mean and variance of the data so far.

answered Sep 04 '17 at 08:57

dmuir

4,211
2
14
12

Thx for great answer. I especially like how you do it in one pass. I understand that mean is correct. However the formula for variance seems wrong to me. What about the following formula? `vaiance = (1.0-f)*variance + f*d*d*(1.0 + f)` That would make better sense to me. – O.Rerla Sep 04 '17 at 14:03
I'm pretty sure my formula is correct. Your's can't be for after the first data item, when f=1, your formula gives a non zero value, 2*d*d where d is the first data value (because the mean is initialised to 0). But the variance of a collection of 1 things is 0. – dmuir Sep 04 '17 at 15:28
Your counterexample is right. So is your formula. I coded the program with your formula and it returns expected values. So I marked your answer as accepted. However I still don't see why it works. – O.Rerla Sep 04 '17 at 16:06
In my opinion you can still run into problems with huge integers with this approach: the first time you calculate d * d ("mean" was zero so you have x * x) you are potentially causing a huge error. Later on this can also cause trouble if the mean varies wildly. The number of significant digits should be dtermined beforehand and input data should be scaled accordingly to make sure your results are solid... – C. Gonzalez Sep 04 '17 at 16:15
C. Gonzalez: Variance in first iteration is (1.0-f) * ... which equals 0, because f is 1/1. – O.Rerla Sep 04 '17 at 16:25
Yoe are right, forgot that :-). Anyhow, with real adata and varying averages you still can explode de d * d term, no? – C. Gonzalez Sep 04 '17 at 16:42

score -1 · Answer 2 · answered Sep 04 '17 at 00:18

-1

I think you are confusing the max / min number possible (±5.0 × 10−324 to ±1.7 × 10308) with the number of significant digits available (15 - 16) for double.

In your case, I´d say that you are wasting digits by not scaling the input first, that is transforming your value to 1.47834, with a scale factor of 1 / 10^7 for your numeric calculations.

answered Sep 04 '17 at 00:18

C. Gonzalez

689
7
8

Scaling has nothing to do here. It is anyway done by the FP representation and can be factored out or not. – Sep 05 '17 at 15:39

huge numerical errors when trying to normalize data

2 Answers2