2

Ok, so here is the code that demonstrate the problem I am referring to:

x1 <- c(0.001, 0.002, 0.003, 0.0003)
x2 <- c(15000893, 23034340, 3034300, 232332242)
x3 <- c(1,3,5,6)
y <- rnorm( 4 )
model=lm( y ~ x1 + x2 )
model2=lm( y ~ x1 + x3 )
type <- "hc0"
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
linearHypothesis(model, hyp, white.adjust=type)

Note that this is not caused by perfect multicollinearity. As you can see, I deliberately set the value of x2 to be very large and the value of x1 to be very small. When this happens, I cannot perform a linearHypothesis test of model=lm( y ~ x1 + x2 ) on all coefficients being 0: linearHypothesis(model, hyp, white.adjust=type). R will throw the following error:

> linearHypothesis(model, hyp, white.adjust=type)
Error in solve.default(vcov.hyp) : 
  system is computationally singular: reciprocal condition number = 2.31795e-23

However, when I use model2=lm( y ~ x1 + x3 ) instead, whose x3 is not too large compared to x1, the linearHypothesis test succeeds:

> linearHypothesis(model2, hyp, white.adjust=type)
Linear hypothesis test

Hypothesis:
x1 = 0
x3 = 0

Model 1: restricted model
Model 2: y ~ x1 + x3

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F Pr(>F)
1      3                 
2      1  2 11.596 0.2033

I am aware that this might be caused by the fact that R cannot invert matrices whose numbers are smaller than a certain extent, in this case 2.31795e-23. However, is there a way to circumvent that? Is this the limitation in R or the underlying C++? What is the good practice here? The only method I can think of is to rescale the variables so that they are at the same scale. But I am also concerned about the amount of information I will lose by dividing everything by their standard errors.

In fact, I have 200 variables that are percentages, and 10 variables (including dependent variables) that are large (potentially to the 10^6 scale). It might be troubling to scale them one by one.

Jinhua Wang
  • 1,679
  • 1
  • 17
  • 44
  • Why is scaling "troublesome"? Scaling variables is common and easy. Some people do it always. They don't even need to be the *same* scale, just not so vastly different. Use `x2_scaled = scale(x2)` instead of `x2`. – Gregor Thomas Sep 18 '19 at 17:51
  • The error message states _**computationally** singular_ which may indicate a precision issue of R or your machine respectively.. – jay.sf Sep 18 '19 at 17:54
  • @Gregor I thought it is troublesome because I have 200 variables that are small, and 10 variables that are large. – Jinhua Wang Sep 18 '19 at 18:03
  • @Gregor Besides, I am also a little concerned about the information loss of dividing everything by standard errors. – Jinhua Wang Sep 18 '19 at 18:04
  • @jay.sf Thanks. I think it is a precision issue indeed. – Jinhua Wang Sep 18 '19 at 18:08
  • So scale all of them. But no information needs to be lost - just save the SDs before you scale. You can transform the coefficients back to the original scale if you want. You can scale all the variables in a data frame with `df[] = lapply(df, scale)`. You can also scale just some of them with `vars_to_scale = c("var1", "var8", ...); df[vars_to_scale] = lapply(df[vars_to_scale])` – Gregor Thomas Sep 18 '19 at 18:20
  • @Gregor I don't think it is that easy to recover the coefficients - because scaling usually involves demeaning and dividing by standard deviation. See this post: https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning – Jinhua Wang Sep 18 '19 at 19:23
  • 1
    For nonlinear methods, e.g., SVM, you are losing useful information. [For ordinary linear regression, recovering the coefficients is not difficult](https://stackoverflow.com/a/14512703/903061). The centering (subtracting the mean) is accounted for in the intercept, and the the scaling (dividing by the SD) can be compensated for too. Try it! `m1 = lm(mpg ~ wt + drat, data = mtcars); m2 = lm(mpg ~ wt + scale(drat), data = mtcars)`. You'll see that multiplying the `drat` coefficient from `m1` by `sd(mtcars$drat)` gets you the `scale(drat)` coef in `m2`. The `wt` coefficient is unchanged. – Gregor Thomas Sep 18 '19 at 19:38
  • @Gregor Looks like scaling could affect linearHypothesis test though: https://stats.stackexchange.com/questions/340544/linear-hypothesis-tests-with-unstandardized-and-standardized-variables-why-are – Jinhua Wang Sep 18 '19 at 19:59
  • Not being able to fit your model has a larger effect. – Gregor Thomas Sep 18 '19 at 20:12

0 Answers0