1

I have a large dataset that I've been trying to run a lasso regression on. Categorical variables are re-coded to dummies. After receiving several messages regarding limited memory, I converted my data into a sparse matrix using Matrix.

The issue is that my code has been running for a long time (several hours without completion), and I'm not sure why.

Here is a sample of 2000 rows of data (~0.3% of data) that produces the same issue: https://drive.google.com/file/d/1ZhyFIoxJSRHrC_eIe58C5zXFKJW-13Lm/view?usp=sharing

This is the code I've been using:

    library(tidyverse)
    library(Matrix)
    install.packages('glmnet')
    library(glmnet)
    pacman::p_load(methods,utils,foreach,shape,survival,Rcpp,RcppEigen)

    data_sample_matrix = as.matrix(data_sample) %>% Matrix(.,sparse = TRUE)

    set.seed(879)

    split <- sample(nrow(data_sample_matrix), floor(0.8*nrow(data_sample_matrix)))
    
    train <- data_sample_matrix[split,]
    test <- data_sample_matrix[-split,]
    
    train_s <- train[,-28]
    test_s <- test[,-28]
    
    cv_model = cv.glmnet(train_s, train[,28], alpha=1, family = "binomial", nlambda=10, 
                         trace.it = TRUE)

Note: I've included all the packages supposed to be uploaded with glmnet per the CRAN because I noticed that they weren't being uploaded when I did library(glmnet).

Note: [,28] represents my outcome variable.

Can anyone point to what I'm doing wrong?

thou
  • 35
  • 4
  • The main reason I think something is wrong is that the progress bar never loads. – thou Apr 19 '22 at 20:11
  • This might be helpful [https://stackoverflow.com/questions/17032264/big-matrix-to-run-glmnet](https://stackoverflow.com/questions/17032264/big-matrix-to-run-glmnet) – ML_Enthu Apr 19 '22 at 20:31
  • I appreciate the post! Are you suggesting I use biglasso? My data is already in sparse matrix format. It's worth pointing out that I have a 2000 x 1702 sparse matrix and am still having the same issue... – thou Apr 19 '22 at 20:40
  • Can you provide how you loaded the data? There may be a bug in the script you sent. `Error in source("dput_sample_data.txt") : dput_sample_data.txt:69819:7: unexpected ',' 69818: 0L)), row.names = c(NA, -2000L), class = "data.frame") 69819: 0L,` – James Yang Apr 20 '22 at 18:23
  • I'm also having trouble with your data. I deleted everything after row 69819 (looked like you duplicated some stuff by accident), but then I end up with a **very** weird data frame (with elements of different lengths). You could probably `dput(data_sample_matrix[1:2000,])` for a more compact data set (it would preserve sparsity ...) – Ben Bolker Apr 20 '22 at 18:33
  • dear all, I am very sorry for the error on my part. I've reuploaded the data as a csv file. – thou Apr 20 '22 at 19:20
  • 1
    Thanks for the csv. For me, the model is being fit in reasonable time (each fold takes less than a second) but for some of the holds, the cv training set contains too few observations from one of the classes that `cv.glmnet` fails. Indeed, doing `summary(as.factor(data_sample_matrix[,28]))` shows that there are only 4 data points labeled as `1`, which is too few for logistic regression to work well. – James Yang Apr 20 '22 at 20:27
  • https://rdrr.io/rforge/CrossValidate/man/balancedSplit.html ? – Ben Bolker Apr 20 '22 at 21:06
  • Thank you so much for the validation! Just to confirm, my code has no issues? I will mention when I try running the cv.glmnet, I get the error message "Error in get_int_parms(fdev = double(1), eps = double(1), big = double(1), : function 'Rcpp_precious_remove' not provided by package 'Rcpp'". I have tried uninstalling and reinstalling Rcpp and RcppEigen but the error message still pops up. If I run the same code after reinstalling the packages and after the first error message, RStudio seems to just stall. – thou Apr 20 '22 at 21:22
  • There are other questions on here referring to `Rcpp_precious_remove` https://stackoverflow.com/questions/68416435/rcpp-package-doesnt-include-rcpp-precious-remove , although they claim that updating/reinstalling should work. (1) What version of R? (2) Have you tried installing *from source*? – Ben Bolker Apr 20 '22 at 21:56
  • 1) Version 4.0.2; 2) I'm unsure what that means. I've done the install.packages() and have also used the point and click methods – thou Apr 20 '22 at 22:21
  • Okay, so I figured out that a lot of the issues I was facing was due to R being old. After updating R to the newest version, the issues I was having went away. However, @JamesYang, to your point, I now get this error "(error code -2); Convergence for 2th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned". I have ~1900 total outcomes out of 600K. Is this error due to too few outcomes? – thou Apr 21 '22 at 02:06
  • Yes your original data matrix already only contains 4 data points labeled as 1 and that's too little for the fit to be accurate – James Yang Apr 22 '22 at 03:20

0 Answers0