1

For an application I am building I need to run linear regression on large datasets in order to obtain residuals. For example, one dataset is more than 1 million x 20k in dimension. For the smaller datasets I was using fastLm from the RcppArmadillo package - which works great for those - currently. With time those datasets will also grow beyond 1 million rows.

My solution was to use sparse matrices and Eigen. I was unable to find a good example for using SparseQR in RcppEigen. Based on many hours of reading (e.g. rcpp-gallery, stackoverflow, rcpp-dev mailinglist, eigen docs, rcpp-gallery, stackoverflow and many more that I have forgotten but sure have read) I wrote the following piece of code;

(NB: my first c++ program - please be nice :) - any advice to improve is welcomed)

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
using namespace Rcpp;
using namespace Eigen;

using Eigen::Map;
using Eigen::SparseMatrix;
using Eigen::MappedSparseMatrix;
using Eigen::VectorXd;
using Eigen::SimplicialCholesky;


// [[Rcpp::export]]
List sparseLm_eigen(const SEXP Xr, 
                    const NumericVector yr){
  
  typedef SparseMatrix<double>        sp_mat;
  typedef MappedSparseMatrix<double>  sp_matM;
  typedef Map<VectorXd>               vecM;
  typedef SimplicialCholesky<sp_mat>  solver;

  const sp_mat Xt(Rcpp::as<sp_matM>(Xr).adjoint());
  const VectorXd Xty(Xt * Rcpp::as<vecM>(yr));
  const solver Ch(Xt * Xt.adjoint());
  
  if(Ch.info() != Eigen::Success) return "failed";
  
  return List::create(Named("betahat") = Ch.solve(Xty));
}

This works for example for;

library(Matrix)
library(speedglm)
Rcpp::sourceCpp("sparseLm_eigen.cpp")

data("data1")
data1$fat1 <- factor(data1$fat1)
mm <- model.matrix(formula("y ~ fat1 + x1 + x2"), dat = data1)

sp_mm <- as(mm, "dgCMatrix")
y <- data1$y

res1 <- sparseLm_eigen(sp_mm, y)$betahat
res2 <- unname(coefficients(lm.fit(mm, y)))

abs(res1 - res2)

It fails however for my large datasets (as I kind of expected). My initial intention was to use the SparseQR as a solver but I don't know how to implement that.

So my question - can someone help me to implement QR decomposition for sparse matrices with RcppEigen?


Solution I Used

Disclaimer: I don't know if it's correct - it works for my particular problem and seems to deliver correct results.

As per @TomWenseleers comment to add my solution, here it is. I could not get Eigen's LeastSquareConjugateGradient to work. I think because as the documentation states A'A must be positive definitive. So instead of solving A = bx I am solving A'A = A'bx using Eigen's Conjugate Gradient with a diagonal preconditioner. Then using the coefficients to calculate the fitted values and the residuals.

#include <RcppEigen.h>
#include <Eigen/IterativeLinearSolvers>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Rcpp::List sparse_cg(const Eigen::Map<Eigen::SparseMatrix<double> > &A,
                     const Eigen::Map<Eigen::VectorXd> &b,
                     const Eigen::Map<Eigen::VectorXd> &x0,
                     const int &maxiter,
                     const double &tol) {
    
  // Set-up the Conjugate Gradient solver
  Eigen::ConjugateGradient<Eigen::SparseMatrix<double>,
                           Eigen::Lower|Eigen::Upper,
                           Eigen::DiagonalPreconditioner<double> > cg;
  // Initialize solver
  cg.setMaxIterations(maxiter);
  cg.setTolerance(tol);
  cg.compute(A);

  // Solve system with guess (i.e. old solutions)
  Eigen::VectorXd coef = cg.solveWithGuess(b, x0);

  // Solver convergence stats
  int iter = cg.iterations();
  double err = cg.error();

  return Rcpp::List::create(Rcpp::Named("coefficients") = coef,
                            Rcpp::Named("itr") = iter,
                            Rcpp::Named("error") = err);
}
tstev
  • 607
  • 1
  • 10
  • 20
  • 2
    Sparse QR solver usage is detailed here: https://eigen.tuxfamily.org/dox/group__TopicSparseSystems.html – coatless Apr 12 '17 at 21:02
  • 1
    @coatless I have been through the Eigen documentation - in all honesty (since I have no formal mathematics or programming education) it's very hard to read and understand. I really like Armadillo documentation - but from what I understand it currently works best with dense matrices without SuperLU library. Installing SuperLU is currently out of the question. – tstev Apr 13 '17 at 09:00
  • Would some iterative solver like in https://github.com/kisungyou/Rlinsolve or https://github.com/styvon/cPCG be an option for this problem? – Tom Wenseleers Sep 03 '19 at 10:03
  • Or Eigen's iterative solver, https://eigen.tuxfamily.org/dox/classEigen_1_1LeastSquaresConjugateGradient.html ? SparseM::slm.fit and speedglm::speedlm.fit may also be worth testing - they were the fastest sparse linear solvers for my problems in available R packages (with Intel MKL installed, which you get if you install Microsoft Open R). – Tom Wenseleers Sep 03 '19 at 14:54
  • Hi Tom, thanks for the comments. Rlinsolve didnt work for my large datasets. Nor did speedglm or SparseM. Eventually I solved it with Eigen PCG solvers. Works like a charm within seconds. cPCG I didnt know about. Thanks will check that out. – tstev Sep 04 '19 at 12:09
  • @tstev Would you mind posting your code by any chance? Just in case it could be useful for someone else facing this problem... – Tom Wenseleers Mar 02 '23 at 16:28
  • 1
    @TomWenseleers i added my solution - full disclosure, it works for my particular problem - not sure it's a general good solution... – tstev Mar 04 '23 at 17:33

1 Answers1

4

How to write a sparse solver with Eigen is a bit generic. This is mainly because the sparse solver classes are designed superbly well. They provide a guide explaining their sparse solver classes. Since the question focuses on SparseQR, the documentation indicates that there are two parameters required to initialize the solver: SparseMatrix class type and OrderingMethods class that dictates the supported fill-reducing ordering method.

With this in mind, we can whip up the following:

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
#include <Eigen/SparseQR>

// [[Rcpp::export]]
Rcpp::List sparseLm_eigen(const Eigen::MappedSparseMatrix<double> A, 
                          const Eigen::Map<Eigen::VectorXd> b){

  Eigen::SparseQR <Eigen::MappedSparseMatrix<double>, Eigen::COLAMDOrdering<int> > solver;
  solver.compute(A);
  if(solver.info() != Eigen::Success) {
    // decomposition failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }
  Eigen::VectorXd x = solver.solve(b);
  if(solver.info() != Eigen::Success) {
    // solving failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }

  return Rcpp::List::create(Rcpp::Named("status") = true,
                            Rcpp::Named("betahat") = x);
}

Note: Here we create a list that always passes a named status variable that should be checked first. This indicates whether convergence happens in two areas: decomposition and solving. If all checks out, then we pass the betahat coefficient.


Test Script:

library(Matrix)
library(speedglm)
Rcpp::sourceCpp("sparseLm_eigen.cpp")

data("data1")
data1$fat1 <- factor(data1$fat1)
mm <- model.matrix(formula("y ~ fat1 + x1 + x2"), dat = data1)

sp_mm <- as(mm, "dgCMatrix")
y <- data1$y

res1 <- sparseLm_eigen(sp_mm, y)
if(res1$status != TRUE){
    stop("convergence issue")
}
res1_coef = res1$betahat
res2_coef <- unname(coefficients(lm.fit(mm, y)))

cbind(res1_coef, res2_coef)

Output:

        res1_coef    res2_coef
[1,]  1.027742926  1.027742926
[2,]  0.142334262  0.142334262
[3,]  0.044327457  0.044327457
[4,]  0.338274783  0.338274783
[5,] -0.001740012 -0.001740012
[6,]  0.046558506  0.046558506
coatless
  • 20,011
  • 13
  • 69
  • 84
  • 1
    Rcpp Gallery post? – Dirk Eddelbuettel Apr 13 '17 at 01:30
  • @coatless, awesome! I chose SparseQR exactly for the reasons meantioned on the page you linked, namely; any matrix kind and recommended for large least square problems. I am about to test it on the larger datasets. I will report back asap. – tstev Apr 13 '17 at 09:04
  • @coatless, one question pure out of curiosity as I have never programmed in c++; is there a reason why you don't use `using namespace Eigen` etc, or `typedef [whatever]` or `using Eigen::[blah]`? In many examples you see this. I figured it was to reduce amount of typing. Do you have different motivation besides preference? – tstev Apr 13 '17 at 09:08
  • 1
    Never ever fully load in a `namespace` (e.g. `using namespace Eigen;`) as multiple libraries may have the _same_ function name leading to ambiguity between function calls. Regarding dropping the `typedef`, this was to try to simplify matters by directly referring to `Eigen` classes instead of pseudo referring to a class by a highly customized user specific name. – coatless Apr 13 '17 at 13:57
  • @coatless - I have tested it on my full dataset (N = ~900,000, M = ~6000) and it is still running. Seems like this is not an option for me unfortunately. It was kind of my last hope :( thanks for everything though. I learned something new. – tstev Apr 14 '17 at 11:27