0

I'm trying to write a function that calculates the regression line from a data sheet with the least squares methode, but I encountered some serious problem with my code.

My first issue is that I don't know why my "linear regression" function is rounding the result of the iterations, even when I'm trying to use other "bigger" types.

My second issue is that the last part of my code is giving me the wrong results for the y intercept (b) and the slope (a), and I think that it could be a problem of conversion but I'm not really sure. If it is the case what should I do to avoid it?

void RegLin (const vector<double>& valuesX, const vector<double>& valuesY, vector<double>& PenOrd) {

unsigned int N=valuesX.size();

long double SomXi{0};
    for (unsigned i=0; i<N; ++i){
    SomXi+=valuesX.at(i);
    }

long double SomXiXi{0};             
    for (unsigned i=0; i<N; ++i){              //Here is a problem (number rounded) Expected value: 937352,25 / Given value: 937352
    SomXiXi+=(valuesX.at(i))*(valuesX.at(i));
    }

long double SomYi{0};
    for (unsigned i=0; i<N; ++i){
    SomYi+=valuesY.at(i);
    }

long double SomXiYi{0};
    for (unsigned i=0; i<N; ++i){              //Here is the same problem Excepted value: 334107,41 / Given value: 334107
    SomXiYi+=(valuesX.at(i))*(valuesY.at(i));
    }

long double a=(SomYi*SomXiXi-SomXi*SomXiYi)/(N*SomXiXi-pow(SomXi,2)); //Bad result

long double b=(N*SomXiYi-SomYi*SomXi)/(N*SomXiXi-pow(SomXi,2)); //Bad result

PenOrd.push_back(a);
PenOrd.push_back(b);

return;
}

Thank you in advance for your support

P.S: I'm using g++ and the 2011 C++ standard.

Beginner
  • 3
  • 4
  • 3
    The description "rounding the result" is meaningless. Describe what you mean (e.g. with an example input, expected output, and actual output). – Peter Apr 02 '17 at 15:19
  • I'm not quite sure what you are asking. But, taking a guess; maybe you'd want to read these (and remember, floating point math is not exact - it can't be): http://stackoverflow.com/questions/588004/is-floating-point-math-broken , https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – Jesper Juhl Apr 02 '17 at 15:30
  • Your formulas for a and b don't look correct to me. Have a look at the correct formulas on Wikipedia: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line Where are you computing the averages for Xi and Yi? Here is a complete solution btw: http://stackoverflow.com/a/18974171/1291717 – Sergey Slepov Apr 02 '17 at 15:49
  • 1
    Possible duplicate of [Is floating point math broken?](http://stackoverflow.com/questions/588004/is-floating-point-math-broken) – Richard Critten Apr 02 '17 at 15:53
  • @SergeySlepov I'm using the least squares methode to calculate it, and I'm applying the formula that they gave me during the programming course, but maybe you are right, because when I see the formula on the paper it seems strange for me too... – Beginner Apr 02 '17 at 16:01
  • @Sergey: The average for Xi is `SomXi/N` using the notation in the question, and occurrences of `N` will cancel top and bottom. Have a look at the other answers at your link, and you'll see they line up very well with the code here. – Ben Voigt Apr 17 '17 at 01:38

3 Answers3

2

there are serveral points with your effort. I'm a theoretical physics and numerical maths guy. So let me share some best practices with you.

First, I never encountered the need for using long double. Stick with double, because if that would not suffice, then you should go about considering working log-log plots to further analyse your data.

Second, you are using unsigned int instead of int. You should never do regression work with that much values (i.e., value-pairs) that it isn't sufficient to use int or best std::size_t for your integer counters. Using too many values can degrade accuracy because of accumulating numerical roundoff issues. So do not use use more than say 10000 to 1 Million of values until you have a very good reason to do so.

Third, it quickly becomes necessary to not bluntly add your squares (e.g., for SumXiXi and so forth) but to sort your contributions to the sum before actually summing them up. You correctly sum them up starting with the tiniest of values proceeding with your ever growing contributions to your sums. That is the only way to stay on top of accumulating roundoff issues.

Fourth, control of results. A good sign of reliability of results is achievable, if you to your work twice, one time as you did go about (i.e., using x_iy_i - xy_i - x_iy + xy formuae) and then as a second approach using the still un-multiplicated (x_i - x)(y_i - y) formulae. Good quality calculations would yield very comparable results using either formulae.

So, maybe that was quite a detour about doing numerical regression work, hope it might help a little.

Regards, Micha

Micha
  • 181
  • 5
0

The first rule of numerical calculations with floating point says: "Work only with values of the same order".

Floating point mathematics works pretty simple in case of e.g. addition (float) :

1e6 + 1e-6 = 1000000 + 0.000001 = 1000000.000001 = 1000000 = 1e6
                                            ^
                                     precision limit

So, as you see, result is "rounded".

Alex
  • 9,891
  • 11
  • 53
  • 87
0

It's entirerly possible that you are given error due to 2 possibilites:

1) long double == double on your compiler and you are getting wrong results

2) floating point arithmetic does not represent values with 100% accuracy and therefore '0.10 != 0.10 written as float/double

Depending on what kind of calculations you are doing, I would adivse you to increase values by a few power OR change the data to float and store values in double.

johnyyonehand
  • 52
  • 1
  • 6