Optimizing Gaussian Process in Stan/rstan

Question

I have recently encountered Gaussian process models and happen to think that they may be the solution to a problem I have been working on in my lab. I have an open and related question on Cross Validated, but I wanted to separate out my modeling/math questions from my programming questions. Hence, this second, related post. If knowing more about the background of my problem would help though here is the link my open CV question.

Here is my stan code that corresponds to the updated covariance functions presented in my CV post:

functions{
    //covariance function for main portion of the model
    matrix main_GP(
        int Nx,
        vector x,
        int Ny,
        vector y, 
        real alpha1,
        real alpha2,
        real alpha3,
        real rho1,
        real rho2,
        real rho3,
        real rho4,
        real rho5,
        real HR_f,
        real R_f){
                    matrix[Nx, Ny] K1;
                    matrix[Nx, Ny] K2;
                    matrix[Nx, Ny] K3;
                    matrix[Nx, Ny] Sigma;

                    //specifying random Gaussian process that governs covariance matrix
                    for(i in 1:Nx){
                        for (j in 1:Ny){
                            K1[i,j] = alpha1*exp(-square(x[i]-y[j])/2/square(rho1));
                        }
                    }

                    //specifying random Gaussian process incorporates heart rate
                    for(i in 1:Nx){
                        for(j in 1:Ny){
                            K2[i, j] = alpha2*exp(-2*square(sin(pi()*fabs(x[i]-y[j])*HR_f))/square(rho2))*
                            exp(-square(x[i]-y[j])/2/square(rho3));
                        }
                    }

                    //specifying random Gaussian process incorporates heart rate as a function of respiration
                    for(i in 1:Nx){
                        for(j in 1:Ny){
                            K3[i, j] = alpha3*exp(-2*square(sin(pi()*fabs(x[i]-y[j])*HR_f))/square(rho4))*
                            exp(-2*square(sin(pi()*fabs(x[i]-y[j])*R_f))/square(rho5));
                        }
                    }

                    Sigma = K1+K2+K3;
                    return Sigma;
                }
    //function for posterior calculations
    vector post_pred_rng(
        real a1,
        real a2,
        real a3,
        real r1, 
        real r2,
        real r3,
        real r4,
        real r5,
        real HR,
        real R,
        real sn,
        int No,
        vector xo,
        int Np, 
        vector xp,
        vector yobs){
                matrix[No,No] Ko;
                matrix[Np,Np] Kp;
                matrix[No,Np] Kop;
                matrix[Np,No] Ko_inv_t;
                vector[Np] mu_p;
                matrix[Np,Np] Tau;
                matrix[Np,Np] L2;
                vector[Np] yp;

    //--------------------------------------------------------------------
    //Kernel Multiple GPs for observed data
    Ko = main_GP(No, xo, No, xo, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);
    Ko = Ko + diag_matrix(rep_vector(1, No))*sn;

    //--------------------------------------------------------------------
    //kernel for predicted data
    Kp = main_GP(Np, xp, Np, xp, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);
    Kp = Kp + diag_matrix(rep_vector(1, Np))*sn;

    //--------------------------------------------------------------------
    //kernel for observed and predicted cross 
    Kop = main_GP(No, xo, Np, xp, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);

    //--------------------------------------------------------------------
    //Algorithm 2.1 of Rassmussen and Williams... 
    Ko_inv_t = Kop'/Ko;
    mu_p = Ko_inv_t*yobs;
    Tau=Kp-Ko_inv_t*Kop;
    L2 = cholesky_decompose(Tau);
    yp = mu_p + L2*rep_vector(normal_rng(0,1), Np);
    return yp;
    }
}

data { 
    int<lower=1> N1;
    int<lower=1> N2;
    vector[N1] X; 
    vector[N1] Y;
    vector[N2] Xp;
    real<lower=0> mu_HR;
    real<lower=0> mu_R;
}

transformed data { 
    vector[N1] mu;
    for(n in 1:N1) mu[n] = 0;
}

parameters {
    real loga1;
    real loga2;
    real loga3;
    real logr1;
    real logr2;
    real logr3;
    real logr4;
    real logr5;
    real<lower=.5, upper=3.5> HR;
    real<lower=.05, upper=.75> R;
    real logsigma_sq;
}

transformed parameters {
    real a1 = exp(loga1);
    real a2 = exp(loga2);
    real a3 = exp(loga3);
    real r1 = exp(logr1);
    real r2 = exp(logr2);
    real r3 = exp(logr3);
    real r4 = exp(logr4);
    real r5 = exp(logr5);
    real sigma_sq = exp(logsigma_sq);
}

model{ 
    matrix[N1,N1] Sigma;
    matrix[N1,N1] L_S;

    //using GP function from above 
    Sigma = main_GP(N1, X, N1, X, a1, a2, a3, r1, r2, r3, r4, r5, HR, R);
    Sigma = Sigma + diag_matrix(rep_vector(1, N1))*sigma_sq;

    L_S = cholesky_decompose(Sigma);
    Y ~ multi_normal_cholesky(mu, L_S);

    //priors for parameters
    loga1 ~ student_t(3,0,1);
    loga2 ~ student_t(3,0,1);
    loga3 ~ student_t(3,0,1);
    logr1 ~ student_t(3,0,1);
    logr2 ~ student_t(3,0,1);
    logr3 ~ student_t(3,0,1);
    logr4 ~ student_t(3,0,1);
    logr5 ~ student_t(3,0,1);
    logsigma_sq ~ student_t(3,0,1);
    HR ~ normal(mu_HR,.25);
    R ~ normal(mu_R, .03);
}

generated quantities {
    vector[N2] Ypred;
    Ypred = post_pred_rng(a1, a2, a3, r1, r2, r3, r4, r5, HR, R, sigma_sq, N1, X, N2, Xp, Y);
}

I have tinkered around with the priors for the parameters included in my kernels, some parameterizations are a bit faster (up to two times faster in some instances, but can still produce relatively slow chains even in those cases).

I am trying to predict values for 3.5s worth of data (sampled at 10 Hz - so 35 data points), using data from the 15 seconds preceding and following the contaminated section (sampled at 3.33 Hz so 100 total data points).

The model as specified in R is as follows:

 fit.pred2 <- stan(file = 'Fast_GP6_all.stan',
                 data = dat, 
                 warmup = 1000,
                 iter = 1500,
                 refresh=5,
                 chains = 3,
                 pars= pars.to.monitor
                 )

I do not know if I need that many warmup iterations to be honest. I imagine part of the slow estimation is the result of fairly non-informative priors (except for heart rate and respiration HR & R as those have fairly well known ranges at rest in a healthy adult).

Any suggestions are more than welcome to speed up my program's run time.

Thanks.

Ben Goodrich · Accepted Answer · 2018-02-02T18:18:58.840

You could grab the develop branch of the Stan Math Library, which has a recently updated version of multi_normal_cholesky that uses analytic gradients internally instead of autodiff. To do so, you can execute in R source("https://raw.githubusercontent.com/stan-dev/rstan/develop/install_StanHeaders.R") but you need to have CXXFLAGS+=-std=c++11 in your ~/.R/Makevars file and may need to reinstall the rstan package afterward.

Both multi_normal_cholesky and your main_GP are O(N^3), so your Stan program is never going to be especially fast but incremental optimizations of those two functions are going to make the biggest difference.

Beyond that, there are some small things like Sigma = Sigma + diag_matrix(rep_vector(1, N1))*sigma_sq; which should be changed to for (n in 1:N1) Sigma[n,n] += sigma_sq; The reason is that multiplying sigma_sq by a diagonal matrix puts N1 squared nodes onto the autodiff tree, as does adding it to Sigma, which does a lot of memory allocation and deallocation. The explicit loop along the diagonal only puts N1 nodes onto the autodiff tree, or maybe it just updates the existing tree if we are clever with the += operator. Same deal inside your post_pred_rng function, although that is less critical because the generated quantities block is evaluated with doubles rather than the custom Stan type for reverse-mode autodiff.

I think / hope that vector[N2] Ypred = post_pred_rng(...); is slightly faster than vector[N2] Ypred; // preallocates Ypred with NaNs Ypred = post_pred_rng(...); by avoiding the preallocation step; in any event, it is nicer to read.

Finally, while it does not affect the speed, you are not obligated to specify your parameters in log form and then antilog them in the transformed parameters block. You can just declare things with <lower=0> and that will result in the same thing, although then you would be specifying your priors on the positively constrained things rather than the unconstrained things. In most cases, that is more intuitive. Those Student t priors with 3 degrees of freedom are very heavy-tailed, which may cause Stan to take a lot of leapfrog steps (up to its limit of 10 by default) at least during warmup. The number of leapfrog steps (s) is the main contributor to the runtime since each iteration requires 2^s - 1 function / gradient evaluations.

First off Ben, thanks so much for this detailed and thoughtful answer. Some of your responses make obvious sense to me on reading. Others I'll try out and work through as I implement them. As someone who has spent the last few weeks pouring over stan-related message boards, I have been utterly impressed by the development team's response to user inquiries. Thank you so much for your continued aim assistance for those of us who try their best to hit their tiny misspecified nails with your well-designed hammer. — Matt Barstead, Feb 02 '18 at 06:34

Optimizing Gaussian Process in Stan/rstan

1 Answers1