1

I’m trying to use CVXR to solve a large weighted least squares regression with constraints. But I am currently running into issues with the problem size. My data set has around 8 million observations and 500 variables. In addition, I have two types of constraints on the regression coefficients:

(1) For most of the coefficients, there is a non-negativity constraint: Their coefficient cannot be negative. One example is that w1 has to be greater than or equal to 0.

(2) Additionally, there are hierarchical constraints: There are hierarchical relationships between numerous variables, that have to be reflected by the coefficients. One example for such a hierarchical constraint is that w1 has to be greater than or equal to w2.

In total, there are 1500 constraints, of which 1200 are hierarchical constraints. The remaining 300 are non-negativity constraints.

My code currently looks like this:

library(Matrix)
library(CVXR)
Predictor_vars <- dimnames(Predictor)[2]
Coefficients <- Variable(rows = dim(Predictor)[2], name = Predictor_vars)
Goal <- sum_squares(Y – (Predictor %*% Coefficients))

#build constraints
Non_negative <- Diagonal(x = c(rep(1,300),rep(0,200))) %*% Coefficients >= 0
Hierarchy <- list()
For (j in 1:1200){
new_rule <- Coefficients[which(high[j] == Predictor_vars)] >= Coefficients[which(low[j] == Predictor_vars)]
Hierarchy <- append(Hierarchy, new_rule)
} 

p <- Problem(Minimize(Goal), constraints = append(Non_negative, Hierarchy))
Result <- solve(p)

Variable Information:

Predictor is a large sparse Matrix (dgC from Matrix package) with observations – the dimensions are 8 million rows by 500 columns, so it’s around 2GB

Y is the target vector with 8 million row entries

high and low are two vectors of coefficient names describing a hierarchical constraint between pairs of coefficients. The coefficient called in index j of the high vector must be greater than or equal to the coefficient called in index j of the low vector (high[j] >= low[j]).

Whether I use the constraints or try to solve the problem without any constraints (which would just be OLS), all solvers return errors, either a "cholmod error (Problem too large)" or “cannot allocate vector of size 46142.2 Gb”.

Any leads or suggestions regarding the following questions would be very appreciated!

(1) Is it possible to rewrite this to obtain a solveable problem for CVXR? If so, how could we go about it?

(2) Are there other ways of solving such an issue with a superior R or Python implementation or a more efficient package?

camel_o
  • 11
  • 2

0 Answers0