Regression model performance fails with a factor having more number of levels

Question

I have a mixed data(both quantitative and categorical) predicting a quantitative variable. I have converted the categorical data into factors before feeding into glm model in R. My data has categorical variables with most of them having more than 150 levels. When I try to feed them to glm model, it fails with memory issues because of these factors having more levels. We can put a threshold and accept only the variables upto certain number of levels. But, I need to embed these factors which has more levels into the model. Is there any methodology to follow to address this issue.

Edit: The dataset has 120000 rows and 50 columns. When the data is expanded with model.matrix there are 4772 columns.

Can you post the error message? It's not clear to me whether this is memory related or not. — Fernando, Mar 10 '17 at 20:08
I tried without putting threshold, RStudio session got aborted. Then, when I explored to put a threshold to reject variables into model with more levels(>150), it worked fine. — bob jones, Mar 10 '17 at 20:12
If your data is sparse, using a sparse matrix may solve the problem (package `glmnet` for example). — Fernando, Mar 10 '17 at 20:14
How many observations do you have? How many columns in the predictor model matrix : use something like `ncol(model.matrix( outcome ~ . , data=yourdataframe))`. What do you mean it fails with memory issues: do you get a warning / error / message etc? — user20650, Mar 10 '17 at 20:27
predictor model matrix has 4772 columns. I just mean that it is taking ever to execute the model — bob jones, Mar 10 '17 at 20:28
https://cran.r-project.org/web/packages/biglm/index.html might be useful. But what are you doing with that amount of columns: I assume prediction model which leads to glmnet as suggested by Fernando — user20650, Mar 10 '17 at 20:29
The thing is, the data has columns like zipcode which has more levels. — bob jones, Mar 10 '17 at 20:34
maybe something here http://stackoverflow.com/questions/3169371/large-scale-regression-in-r-with-a-sparse-feature-matrix — user20650, Mar 10 '17 at 20:35
@user20650 Correct me if I am wrong, But my data is not sparse and infact each cell has a value. — bob jones, Mar 10 '17 at 20:41
no your data will be sparse: factors will be split into columns of ones and zeros in model matrix. Quick example: a 20 by 5 dataframe `dat = data.frame(replicate(5, sample(paste0(1:100, letters), 20))) ; ncol(m <- model.matrix( ~ . , data=dat))` Have a look at `m` — user20650, Mar 10 '17 at 20:49
ps its probably worth adding your `sessionInfo()` to your question, as well as the amount of installed memory, and the number of rows in your dataset. — user20650, Mar 10 '17 at 20:52
well i had to search for *one hot encoding*, but from a glance i would say yes. BUT regression techniques form a dummy matrix like this (R does it automatically in the routines) [This](http://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r#23042363) might help form the sparse matrix to input into , for example, glmnet. — user20650, Mar 10 '17 at 20:59
@user20650 It is failing with below error. Have to research on it. Error in validObject(r) : invalid class “dgTMatrix” object: all column indices (slot 'j') must be between 0 and ncol-1 in a TsparseMatrix — bob jones, Mar 10 '17 at 21:16
Bob, its not a drop in solution, *i think* you will need to amend it to only apply across your factor variables. Perhaps Ben's approach, from the same page will be easier to use. — user20650, Mar 10 '17 at 21:21
yes, realized to apply it to apply only to categorical variables, and your solution worked fine. But, the issue is with the interpretation. Will have to work on associating these columns to the variables/levels in the focus. — bob jones, Mar 10 '17 at 21:27

score 0 · Answer 1 · answered Apr 08 '17 at 01:10

If you have a lot of data, the easiest thing to do is sample from your matrix/data frame, then run the regression.

Given sampling theory, we know that the standard error of a proportion p is equal to sqrt((p(1-p))/n). So if you have 150 levels, assuming that the number of observations in levels is evenly distributed, then we would want to be able to find proportions as small as .005 or so from your data set. So if we take a 10,000 row sample, the standard error of one of those factor levels is roughly:

sqrt((.005*.995)/10000) = 0.0007053368

That's really not all that much additional variance that you added to your regression estimate. Especially when you are doing exploratory analysis, sampling from the rows in your data, say a 12,000 row sample, should still give you plenty of data to estimate quantities while making estimation possible. Reducing your rows by a factor of 10 should also help R do the estimation without running out of memory. Win-win.

Regression model performance fails with a factor having more number of levels

1 Answers1