4

SShort:

GNUPlot gives a much better fit to my data than my GSL code does. Why?

Short:

I am slightly confused at the moment, so my question might not be particularly well worded... I will edit this as my understanding improves.

The original title of this question was: "g++ Compiling code with either -o1 -o2 or -o3 flags and floating point precision"

I believe that my code is suffering from numerical instabilities.

GNUPlot gives a much better fit to my data than my GSL code does, which is surprising since I believe GNUPlot also uses GSL libraries?

Long:

I have written some C / C++ code which makes use of the GNU Scientific Library (GSL). My code fits non-linear functions to sets of non-linear data. Algorithms which do this can be highly sensitive to the order in which floating point operations take place, due to the nature of numerical inaccuracies which result as an accumulation of numerical round-off errors.

Question: "Could these be caused by the effect of running with one of the optimization flags, -o1, -o2 or -o3?"

Partial Answer: I turned off any -oN flags and recompiled my code, I the results may have changed by some small amount, ie: delta_x / x ~= 1.0e-3. The fit is still poor in comparison to GNUPlot.

Functions I am fitting:

I have provided these to show you the numerical work which is occurring. I suspect that some of this is prone to numerical error.

Typical values for Yi would be in the range of 0.0 to 1.0. t is typically in the range 0.0 to 200.0. (But the fit is poor in the first half of that range.)

// Block A - Weighted Residuals

double t = time; // whatever that may be
double s = sigma[ix]; // these are the errors, tried 1.0 and 0.001 with no affect on parameter values obtained

double Yi = (std::sqrt(rho) * r0) / std::sqrt((rho - r0*r0) * std::exp(-2.0 * t / tau) + r0*r0); // y value - model

gsl_vector_set(f, ix, (Yi - y[ix])/sigma[ix]); // weighted residual

// Block B - Jacobian

double MEM_A = std::exp(-2.0 * t / tau); // Tried to do some optimization here
double MEM_B = 1.0 - MEM_A; // Perhaps this is causing a problem?
double MEM_C = rho - r0 * r0;
double MEM_D = MEM_C * MEM_A + r0*r0;
double MEM_E = std::pow(MEM_D, 1.5);

double df_drho = (std::pow(r0, 3.0) * MEM_B) / (2.0 * std::sqrt(rho) * MEM_E);
double df_dr0 = (std::pow(rho, 1.5) * MEM_A) / MEM_E;
double df_dtau = -1.0 * (std::sqrt(rho) * r0 * MEM_C * MEM_A * t) / (tau * tau * MEM_E);

gsl_matrix_set(J, ix, 0, df_drho / s);
gsl_matrix_set(J, ix, 1, df_dr0 / s);
gsl_matrix_set(J, ix, 2, df_dtau / s);

Here is a graph, isn't that nice?

Well here is a graph which explains the issue a lot better than I have been able to in words. You may ignore the green line, this just shows the initial parameters given before a fitting algorithm is run which changes those parameters.

GNUPlot Fit Results:

RHOFIT = 0.086173236829715 +- 2.61304934752193e-05
 R0FIT = 0.00395856812689133 +- 2.08898744280108e-05
TAUFIT = 11.7694359189233 +- 0.016094629240588

// Not sure how GNUPlot calculates errors - they do not appear to be the regular errors computed from the off diagonal elements of the LM matrix after fitting. (If you know a little about the Levenberg–Marquardt algorithm.)

C++ GSL Fit Results:

rho    = 0.08551510 +/- ...
r0     = 0.00507645 +/- ... // Nonsense errors due to "not-real" errors on data points
tau    = 12.99114719 +/- ...

On careful inspection, you will see that the Pink and Blue lines do not overlay each other by quite some considerable margin. The Pink line is what many would describe as a "good fit". The Blue line by comparison isn't particularly good.

I have tried making the errorbars (although they are the same size for all points - they aren't "real" errorbars, only artificial ones) smaller - this doesn't help, only changes the chi-square value and associated errors of each parameter after fitting.

Graph 1

Further Random Ideas:

  • I built GSL wrong?
  • Gnuplot splits the dataset up into small blocks to keep the numbers added together roughly the same order of magnitude? (Kind of like how FFT works.)

GSL Fit Output:

iter: 0 x = 0.1 0.001 10 |f(x)| = 12487.8
status = success
iter: 1 x = 0.0854247 0.00323946 13.2064 |f(x)| = 10476.9
dx vector: -0.0145753, 0.00223946, 3.20642
status = success
iter: 2 x = 0.0854309 0.00576809 13.7443 |f(x)| = 3670.4
dx vector: 6.18836e-06, 0.00252863, 0.537829
chisq/dof = 6746.03
rho    = 0.08543089 +/- 0.00013518
r0     = 0.00576809 +/- 0.00013165
tau    = 13.74425294 +/- 0.09012196
FreelanceConsultant
  • 13,167
  • 27
  • 115
  • 225
  • Could you try to rerun it with `std::pow(xxx, 1.5) -> x*sqrt(x)` and `std::pow(x, 3.0) -> x*x*x` ? – Severin Pappadeux Jul 14 '15 at 17:35
  • @SeverinPappadeux Hmm that actually appears to be worse! – FreelanceConsultant Jul 14 '15 at 18:12
  • Well, here is numerical instability for you – Severin Pappadeux Jul 14 '15 at 18:32
  • So long as your derivatives are correct (I think they are), you should be converging to a better solution than you're seeing. Are your iterations definitely converging? Try dumping out the dx vector at each iteration and check the model parameter adjustment steps are, indeed, getting successively smaller. Are you stopping prematurely? What convergence/stopping criteria are you using? I.e. are you using `gs_multifit_test_delta()`, `gsl_multifit_test_gradient()` or `gsl_multifit_gradient()` and with what parameters? – duncan Jul 14 '15 at 18:50
  • Hi @duncan - you obviously know a lot more about this than myself! I am using gsl_multifit_test_delta, with parameters gsl_multifit_test_delta(s->dx, s->x, 1e-1, 1e-1), which I think is the default form the examples? How will I know if I'm stopping prematurely? I will add the dx dumps to the question above in a second. – FreelanceConsultant Jul 14 '15 at 21:22
  • @duncan - Please see the fit output at the bottom – FreelanceConsultant Jul 14 '15 at 21:25
  • @user3728501 Completely unrelated; have you considered R? – CinchBlue Jul 14 '15 at 21:30
  • @VermillionAzure I know that's a statistics tool - what advantages will that offer? As far as I know it's a separate program like GNUPlot - I cannot integrate it with C++? – FreelanceConsultant Jul 14 '15 at 21:46
  • @user3728501 It's a very popular statistics programming language that has a very nice IDE, package management, and web-app creation out-of-the-box. I'm sure R would be much easier to program in, and better yet, it's easier to just pick up useful packages and go. – CinchBlue Jul 14 '15 at 21:51
  • @user3728501 The syntax is C-like but is vastly simplified. However, everything is just a breeze--it's also possible to integrate C and C++ into it if you so desire. – CinchBlue Jul 14 '15 at 21:52
  • @VermillionAzure Is there any documentation for rcpp - which I assume if what you are referring to? – FreelanceConsultant Jul 14 '15 at 21:55
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/83261/discussion-between-vermillionazure-and-user3728501). – CinchBlue Jul 14 '15 at 21:56
  • The `1e-1` in `gsl_multifit_test_delta(s->dx, s->x, 1e-1, 1e-1)` seems pretty big. It means that iteration will stop when each model parameter (rho, r0, tau) is adjusted by an amount either less than 0.1 in size or less than 10% of the parameter's value. E.g. if tau=13, then an adjustment (dx) in the value of tau of less than 1.3 is considered small enough to quit. Matters are complicated by tau being much bigger than the other two model parameters. Try cranking 1e-1 down to, say, 1e-4 (which is the default used in the example on http://www.gnu.org/software/gsl/manual/gsl-ref_38.html). – duncan Jul 15 '15 at 00:46

1 Answers1

2

I came across this page because I was having the exact same problem. I needed to fit a function with GSL, hadn't done it before so I was comparing the results to gnuplot's fitting routine. In my case, I was fitting a simple power law to a portion of the galaxy power spectrum, and GSL was giving me fits that had chi^2/DoF around 6.

In order to solve this, I figured out that I had been careless and the x values for my data points didn't match the x values where the fitting function was being evaluated. The easiest way to fix this was to create a spline from the data values, and then evaluate the spline at the same x values where the fitting function would be evaluated. For example:

#include <gsl/gsl_spline.h>
    .
    .
    .
std::vector< double > xvals;
std::vector< double > yvals;

fin.open("SomeDataFile.dat", std::ios::in);
while (!fin.eof()) {
    double x, y;
    fin >> x >> y;
    xvals.push_back(x);
    yvals.push_back(y);
}
fin.close();

gsl_spline *Y = gsl_spline_alloc(gsl_interp_cspline, yvals.size());
gsl_interp_accel *acc = gsl_interp_accel_alloc();

gsl_spline_init(Y, &xvals[0], &yvals[0], yvals.size());

double y[N];

for (int i = 0; i < N; ++i) {
    double x = xmin + i*dx; // Where xmin is the smallest x value and dx
                            // is (xmax-xmin)/N
    y[i] = gsl_spline_eval(Y, x, acc);
}

Then in my function where the difference between the power law and data was calculated, I made sure to use the same xmin and dx so that the x values were the same for the Yi function in your notation.

struct data {
    size_t n;
    double *y;
    double xmin;
    double xmax;
};

int powerLaw(const gsl_vector *x, void *dat, gsl_vector *f) {
    size_t n = ((data *) dat)->n;
    double *y = ((data *) dat)->y;
    double xmin = ((data *) dat)->xmin;
    double xmax = ((data *) dat)->xmax;
    double dx = (xmax-xmin)/double(n);

    double A = gsl_vector_get(x, 0);
    double alpha = gsl_vector_get(x, 1);

    for (int i = 0; i < n; ++i) {
        double xval = xmin + double(i)*dx;
        double Yi = A*pow(xval,alpha);
        gsl_vector_set(f, i, Yi - y[i]);
    }

    return GSL_SUCCESS;
}

After that, the values from gnuplot and GSL agree quite nicely, gnuplot giving an amplitude of 123.196 +/- 0.04484, and an exponent of -1.13275 +/- 0.001903 and GSL giving 123.20464 +/- 0.98008 and -1.13272 +/- 0.00707. The results of the fits are shown in the graph below, where Fit is from gnuplot, and g(x) is from GSL (Note: I don't expect the power laws to be a precise match to the data, but is sufficient for my purposes). The fits from gnuplot and GSL are virtually identical.

Plot of the data and fits from gnuplot and GSL.

I would have just mentioned this in a comment to your question, but as I have never had to ask a question here and have never answered one, I didn't have enough rep.

Dave P.
  • 121
  • 5
  • What are your limits for quitting the fitting? I had a 10% limit (see above) which caused the fit to be poor. Are you having the same issue or something different? – FreelanceConsultant Jul 13 '16 at 18:22
  • For me it was as stated in the answer. Initially my function powerLaw wasn't using the correct x values when doing the comparison. I actually have the tolerances for the gradient and step size set to 10^-16 currently, which is probably overkill. – Dave P. Jul 13 '16 at 18:43
  • Just realized an even easier way to deal with this is to simply add a pointer to x values in the struct data, then pass the exact x and y values to the powerLaw function. – Dave P. Jul 13 '16 at 18:53
  • With such a small value I'm surprised your code didn't enter an infinite loop. The smallest value you should consider using is probably 10e-8 – FreelanceConsultant Jul 13 '16 at 19:16
  • Yeah, the only reason it was set to that was to try to see if that was the source of the poor fit. That code was just a test to get used to doing the fitting with GSL (most of the time that I need to fit a model to data I use Markov Chain Monte Carlo). Before using it on something other than the highly idealized, very smooth data that was used for that test, I'll be writing the fitting routine into custom library and will definitely be using a larger tolerance as the real data will be much noisier. – Dave P. Jul 13 '16 at 19:41