Assume I have data with a dependency y(t)
and parameters p1
, p2
and p3
which might influence the value y(t)
.
I create 3 linear equations which depend on the following combinations of the
parameters p1
and p2
- p3
has no impact on y(t)
, that means it follows a random assignment.
You can find a reproducible example in the end of the question.
The 3 equations are
p1 p2 Equation
1 1 5 + 3t
2 1 1 - t
2 2 3 + t
A plot of the 3 equations including random data looks like the following:
Now, if I call lm()
(For formulae see here) based on my random data, I get the following result.
lm(formula = y ~ .^2, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.14707 -0.22785 0.00157 0.23099 1.10528
Coefficients: (6 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.83711 0.17548 27.565 <2e-16 ***
t 2.97316 0.02909 102.220 <2e-16 ***
p12 -3.86697 0.21487 -17.997 <2e-16 ***
p22 2.30617 0.20508 11.245 <2e-16 ***
p23 NA NA NA NA
p32 0.16518 0.21213 0.779 0.4375
p33 0.23450 0.22594 1.038 0.3012
t:p12 -4.00574 0.03119 -128.435 <2e-16 ***
t:p22 2.01230 0.03147 63.947 <2e-16 ***
t:p23 NA NA NA NA
t:p32 0.01155 0.03020 0.383 0.7027
t:p33 0.02469 0.03265 0.756 0.4508
p12:p22 NA NA NA NA
p12:p23 NA NA NA NA
p12:p32 -0.10368 0.21629 -0.479 0.6325
p12:p33 -0.11728 0.21386 -0.548 0.5843
p22:p32 -0.20871 0.19633 -1.063 0.2896
p23:p32 NA NA NA NA
p22:p33 -0.44250 0.22322 -1.982 0.0495 *
p23:p33 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4112 on 136 degrees of freedom
Multiple R-squared: 0.9988, Adjusted R-squared: 0.9987
F-statistic: 8589 on 13 and 136 DF, p-value: < 2.2e-16
If I only want to condsider parameters with high significance, I would argue to ignore parameters close to zero. If I understand correctly, zero-parameters do not lead to "new lines". I then obtain the following simplified model (Values are rounded for readability):
Estimate
(Intercept) 5 ***
t 3 ***
p12 -4 ***
p22 2 ***
t:p12 -4 ***
t:p22 2 ***
I would then reconstruct the theoretical model as follows from the estimate above (only highly significant parameters!):
p1 p2 Equation Result
1 1 5+3t 5+3t
1 2 5+3t+p22+t:p22*t 7+5t
2 1 5+3t+p12+t:p12*t 1-t
2 2 5+3t+p22+t:p22*t+p12+t:p12*t 3+t
Now, 7 + 5t
is obviously wrong, but I am not sure about the reason.
I guess, lm
successively adds the paramters, thus the corresponding model
y ~ t:p2
is not contained in the model above?
This question and references therein might be related, but I didn't look at the lm
result - so there is nothing about that.
Reproducible example:
r <- generate_3lines(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1))
t_m <- r$t_m; y_m <- r$y_m; y_t <- r$y_t; rm(r)
mydata <- generate_randomdata(t_m, y_m, y_t)
# What the raw data looks like:
plot(t_m[[1]], y_t[[1]], type = "l", lty = 3, col = "black", main = "Raw data",
xlim = c(0, 10), ylim = c(min(mydata$y), max(mydata$y)), xlab = "t", ylab = "y")
lines(t_m[[2]], y_t[[2]], col = "black", lty = 3)
lines(t_m[[3]], y_t[[3]], col = "black", lty = 3)
points(x = mydata$t, y = mydata$y)
fit <- lm(y ~ .^2, data = mydata) # Not all levels / variables are linearly
print(summary(fit))
and the functions
generate_3lines <- function(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1)) {
t <- seq(0,10, length.out = 1000) # large sample of x values
t_m <- list()
y_m <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
t_m[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10, 0, sigma)
y_m[[i]] <- slopes[i]*t_m[[i]] + offsets[i] + noise
y_t[[i]] <- slopes[i]*t_m[[i]] + offsets[i]
}
return(list(t_m = t_m, y_m = y_m, y_t = y_t))
}
generate_randomdata <- function(t_m, y_m, y_t) {
# Final data set
df1 <- data.frame(t = t_m[[1]], y = y_m[[1]], p1 = rep(1), p2 = rep(1),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
df2 <- data.frame(t = t_m[[2]], y = y_m[[2]], p1 = rep(2), p2 = rep(2),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
df3 <- data.frame(t = t_m[[3]], y = y_m[[3]], p1 = rep(2), p2 = rep(3),
p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata$p1 <- factor(mydata$p1)
mydata$p2 <- factor(mydata$p2)
mydata$p3 <- factor(mydata$p3)
mydata <- mydata[sample(nrow(mydata)), ]
return(mydata)
}
Edit after input from @MrFlick: The question is now also on Cross Validated
Comment: It seems, the fit is not really automated in ggplot
, see here