1

Assume I have data with a dependency y(t) and parameters p1, p2 and p3 which might influence the value y(t). I create 3 linear equations which depend on the following combinations of the parameters p1 and p2 - p3 has no impact on y(t), that means it follows a random assignment. You can find a reproducible example in the end of the question.

The 3 equations are

p1 p2   Equation   
 1  1   5 + 3t
 2  1   1 - t
 2  2   3 + t

A plot of the 3 equations including random data looks like the following:

enter image description here

Now, if I call lm() (For formulae see here) based on my random data, I get the following result.

lm(formula = y ~ .^2, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.14707 -0.22785  0.00157  0.23099  1.10528 

Coefficients: (6 not defined because of singularities)
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  4.83711    0.17548   27.565   <2e-16 ***
t            2.97316    0.02909  102.220   <2e-16 ***
p12         -3.86697    0.21487  -17.997   <2e-16 ***
p22          2.30617    0.20508   11.245   <2e-16 ***
p23               NA         NA       NA       NA    
p32          0.16518    0.21213    0.779   0.4375    
p33          0.23450    0.22594    1.038   0.3012    
t:p12       -4.00574    0.03119 -128.435   <2e-16 ***
t:p22        2.01230    0.03147   63.947   <2e-16 ***
t:p23             NA         NA       NA       NA    
t:p32        0.01155    0.03020    0.383   0.7027    
t:p33        0.02469    0.03265    0.756   0.4508    
p12:p22           NA         NA       NA       NA    
p12:p23           NA         NA       NA       NA    
p12:p32     -0.10368    0.21629   -0.479   0.6325    
p12:p33     -0.11728    0.21386   -0.548   0.5843    
p22:p32     -0.20871    0.19633   -1.063   0.2896    
p23:p32           NA         NA       NA       NA    
p22:p33     -0.44250    0.22322   -1.982   0.0495 *  
p23:p33           NA         NA       NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4112 on 136 degrees of freedom
Multiple R-squared:  0.9988,    Adjusted R-squared:  0.9987 
F-statistic:  8589 on 13 and 136 DF,  p-value: < 2.2e-16

If I only want to condsider parameters with high significance, I would argue to ignore parameters close to zero. If I understand correctly, zero-parameters do not lead to "new lines". I then obtain the following simplified model (Values are rounded for readability):

            Estimate
(Intercept)        5 ***
t                  3 ***
p12               -4 ***
p22                2 ***
t:p12             -4 ***
t:p22              2 ***

I would then reconstruct the theoretical model as follows from the estimate above (only highly significant parameters!):

p1 p2   Equation                       Result 
 1  1   5+3t                           5+3t   
 1  2   5+3t+p22+t:p22*t               7+5t   
 2  1   5+3t+p12+t:p12*t               1-t   
 2  2   5+3t+p22+t:p22*t+p12+t:p12*t   3+t    

Now, 7 + 5t is obviously wrong, but I am not sure about the reason. I guess, lm successively adds the paramters, thus the corresponding model y ~ t:p2 is not contained in the model above?

This question and references therein might be related, but I didn't look at the lm result - so there is nothing about that.

Reproducible example:

r <- generate_3lines(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1))
t_m <- r$t_m; y_m <- r$y_m; y_t <- r$y_t; rm(r)

mydata <- generate_randomdata(t_m, y_m, y_t)

# What the raw data looks like:
plot(t_m[[1]], y_t[[1]], type = "l", lty = 3, col = "black", main = "Raw data",
     xlim = c(0, 10), ylim = c(min(mydata$y), max(mydata$y)), xlab = "t", ylab = "y")
lines(t_m[[2]], y_t[[2]], col = "black", lty = 3)
lines(t_m[[3]], y_t[[3]], col = "black", lty = 3)
points(x = mydata$t, y = mydata$y)

fit <- lm(y ~ .^2, data = mydata) # Not all levels / variables are linearly
print(summary(fit))

and the functions

generate_3lines <- function(sigma = 0.5, slopes = c(3, 1, -1), offsets = c(5, 3, 1)) {
  t <- seq(0,10, length.out = 1000) # large sample of x values
  t_m <- list()
  y_m <- list()
  y_t <- list()

  for (i in 1:3) {
    set.seed(33*i)
    t_m[[i]] <- sort(sample(t, 50, replace = F))
    set.seed(33*i)
    noise <- rnorm(10, 0, sigma)
    y_m[[i]] <- slopes[i]*t_m[[i]] + offsets[i] + noise
    y_t[[i]] <- slopes[i]*t_m[[i]] + offsets[i]
  }
  return(list(t_m = t_m, y_m = y_m, y_t = y_t))
}

generate_randomdata <- function(t_m, y_m, y_t) {
  # Final data set
  df1 <- data.frame(t = t_m[[1]], y = y_m[[1]], p1 = rep(1), p2 = rep(1),
                    p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace  =  T))
  df2 <- data.frame(t = t_m[[2]], y = y_m[[2]], p1 = rep(2), p2 = rep(2),
                    p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace  =  T))
  df3 <- data.frame(t = t_m[[3]], y = y_m[[3]], p1 = rep(2), p2 = rep(3),
                    p3 = sample(c(1, 2, 3), length(t_m[[1]]), replace = T))
  mydata <- rbind(df1, df2, df3)
  mydata$p1 <- factor(mydata$p1)
  mydata$p2 <- factor(mydata$p2)
  mydata$p3 <- factor(mydata$p3)
  mydata <- mydata[sample(nrow(mydata)), ]
  return(mydata)
}

Edit after input from @MrFlick: The question is now also on Cross Validated

Comment: It seems, the fit is not really automated in ggplot, see here

Christoph
  • 6,841
  • 4
  • 37
  • 89
  • 2
    Is this really a question about programming? Seems maybe that you just need help interpreting a statistical model. [stats.se] is better for statistical questions. – MrFlick Oct 03 '18 at 15:30
  • @MrFlick I can't Imaging that someone without r knowledge knows how to interprete the output of r. – Christoph Oct 03 '18 at 16:58
  • 2
    Despite rumors to the contrary, most statisticians do not use paper and pen to do all their work. Many use computers and many even use R. Most should be able to interpret the results of the `lm()` call. But really you can't just pick and choose significant coefficients from a model. – MrFlick Oct 03 '18 at 17:03
  • @MrFlick I would argue that I do not pick parameters but that i set parameters (which are close to 0) to 0, thus they do not lead to different y values. – Christoph Oct 03 '18 at 17:47
  • @MrFlick: Thanks for pointing me to Cross Validated. I posted the question [there](https://stats.stackexchange.com/q/370052/209028). – Christoph Oct 04 '18 at 07:13

1 Answers1

0

In brief, everything is ok with the model and the result from lm. As explained in this answer on cross-validated, 7+5t is just an extrapolation to a range without data. Furthermore, the synthetic data suffers from collinearity.

Christoph
  • 6,841
  • 4
  • 37
  • 89