0

I have two columns with 10 and 6000 categories in them. I want to run a regression to understand their effects on dependent variable. Is there any way to create a sparse matrix while creating dummy variables in R like sparse=True in Python Pandas?

  • 2
    https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r – user20650 Dec 03 '21 at 01:08
  • I find the methods through the `Matrix` package to be a bit cumbersome and slow for this. I prefer to use the package `sparseMatrixStats`. You install this with `BiocManager::install("sparseMatrixStats")`. Then you only have two arguments, the dataframe or matrix you're converting and the type of matrix - `sparse_mat <- as(denseMatrixObject, "dgCMatrix")`. It took less than a second for a 3136 column, 60,000 row object. – Kat Dec 03 '21 at 01:13
  • @user20650 : I got that but how do I use that matrix in regression? Also, I should get 6010 columns but I'm getting 6000 * 10 columns in my sparse matrix. Any idea why? – Pranjal Srivastava Dec 03 '21 at 05:49
  • @PranjalSrivastava; re the number of columns, did you add an interaction between the two columns i.e. `~ A * B` in stead of `~A + B`. Re sparse you could look at `glmnet` which iirc supports sparse matrices (and for that number of columns you will likely want to regularise anyway) – user20650 Dec 03 '21 at 10:09
  • @user20650: I don't want interaction between the dummy variables of these two columns. I only want to include their fixed effects (dummy variables). SO, I want 6000 + 10 columns only. Even if I create a sparse matrix using sp<- sparseMatrix(i = i, j = j, x = x, index1 = FALSE), I'm getting error when I'm using it in lm to run regression. – Pranjal Srivastava Dec 03 '21 at 17:50
  • @PranjalSrivastava; yes , I got that but you must be getting interactions. Try just using`sparse.model.matrix(~.-1,data=df)`. You don't show what code you are running for the regression (`lm.fit` doesn't support sparse matrices) but see [here](https://stackoverflow.com/questions/3169371/large-scale-regression-in-r-with-a-sparse-feature-matrix) for suggestions. Again, regularised fits with `glmnet` is a good call for this many columns. – user20650 Dec 03 '21 at 18:04
  • Please provide enough code so others can better understand or reproduce the problem. – Community Dec 07 '21 at 06:39
  • 1
    The question text here only asks *how* to make the matrix, making this is a duplicate of [Directly creating dummy variable set in a sparse matrix in R](https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r). Asking how to *use* such a matrix is a separate question. – merv Jan 19 '22 at 01:44
  • @Kat It seems like your comment would make an excellent answer on the duplicate target – Ian Campbell Jan 19 '22 at 05:02

0 Answers0