how to regression with a categorical variable that has a lar

Question

I have a dataframe (1000000 observations) with 2 variables y (wages) and names (122000 names), and I want to explain "y" with "names"

I tried with R and Python

R

mod<-lm(y~names,data=pop1)

R: message error: cannot allocate vector of size 111.0 Gb

python

fit = ols('y ~ C(names)', data=pop1).fit()

MemoryError

Because both error messages are about large numbers and memory allocation, what could the cause be? — Bogdan Doicin, Sep 25 '19 at 10:27
Your memory needs are too damn high. There are packages which deal around that, like biglm in R. — user2974951, Sep 25 '19 at 10:27
Forget it. Dummy encoding a categorical variable with 122k levels needs huge amounts of memory. But that's not your real problem. Even if you had that kind of RAM, the resulting model would be useless. You need a consultant (a statistician or possibly a machine learning specialist). — Roland, Sep 25 '19 at 10:33
the model you're specifying implicitly has ~122k coefficients (i.e. an intercept and one for each "name"). you probably need to rethink the model at the very least you probably want some regularisation in there. if you don't then just run a separate regression for each "name" as it's almost the same (you're currently fixing the intercept across all names) as your model — Sam Mason, Sep 25 '19 at 10:38
thanks, guys I know that the result of this regression will be useless, in a research program about the information hidden in the surnames I'm trying to estimate the % of this information in R^2, even the value is 1% or less it's important for me. — Tarek Janati Idrissi, Sep 30 '19 at 08:25

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Your problem is (as the comments are roaring) that you are lacking memory to perform your calculations. Another very important point is why you want to perform your regression?

With a OLS that only includes a single factor variable (dummy variable) with multiple levels, what you are actually estimating is the group mean, in this case the mean of y for each name. Most LS implementations use a QR decomposition and creates a contrast-design matrix, meaning the intercept is the mean of the first group, while the other coefficients is the difference mean different to the intercept. This is the case in R's lm function. But we can still get the coefficients out, calculate R-squared etc if we really wanted. For illustration here is an example using the mtcars dataset

data(mtcars)
fit <- lm(mpg ~factor(cyl), data = mtcars)
coefs <- tapply(mtcars$mpg, mtcars$cyl, mean)
intercept <- coefs[1]
beta <- c(intercept, coefs[-1] - intercept)
names(beta) <- c("(Intercept)", paste0("cyl", levels(factor(mtcars$cyl))[-1]))
beta
#output
(Intercept)        cyl4        cyl8 
  26.663636   -6.920779  -11.563636
coef(fit)
#output
 (Intercept) factor(cyl)6 factor(cyl)8 
   26.663636    -6.920779   -11.563636 
#output
all.equal(coef(fit), out, check.attributes = FALSE)
[1] TRUE

R-squared is calculated likewise.

But again what do you really want to estimate? In this case a linear regression is a bit overkill.

Edit R-squared

Note that R squared can be calculated simply using the relation Rsquared = 1 - SSE / SST = SSF / SST. In the case of a single factor SSF = var(fitted), and always SST = var(predictor), so rsquared can be achieved as

fitted <- ave(mtcars$mpg, mtcars$cyl, FUN = mean)
ssf <- var(fitted)
sst <- var(mtcars$mpg)
r2 <- ssf / sst
all.equal(r2, summary(fit)$r.squared)
[1] TRUE

thank you for the answer, the important thing I need is the value of r2. — Tarek Janati Idrissi, Sep 30 '19 at 09:42
This can be calculated using `1 - explained variance / total variance`, which you could calculate in a similar fashion. Something like... (back of the napkin) `sst <- var(mtcars$mpg); ssf <- var(ave(mtcars$mpg, mtcars$cyl, FUN = mean); r2 <- ssf/sst` — Oliver, Sep 30 '19 at 09:46
example added to the answer. Remember to give helpful answers an upvote and answers that fully answer the question a "answered" tick, to indicate for others that this helped solve the problem you had. :-) — Oliver, Sep 30 '19 at 09:53

how to regression with a categorical variable that has a lar

R

python

1 Answers1

Edit R-squared