1

I have a dataset which includes some nested variables. For example, I have the following variables: the speed of a car, the existence of another car following it other_car and, if there is another car, the distance between the two cars distance. Dummy dataset:

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)

I would like to include the variables other_car and distance in a model with the form of nested variables, i.e. if the car is present consider also the distance. Following an approach mentioned here: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model , I tried the following:

dft <- data.frame(speed,other_car,distance)
dft$other_car<-factor(dft$other_car)

lm_speed <- lm(speed ~ dft$other_car + dft$other_car:dft$distance)
summary(lm_speed)

Which gives the following error:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Any ideas?

Anna
  • 177
  • 13

1 Answers1

2

This is due to the fact that when other_car==0, distances are all equal to NA, see:

dft$distance[dft$other_car==0]
[1] NA NA NA NA NA NA NA

You could assign a constant distance to replace NA for other_car==0, so that the model uses the factor other_car==0 and finds out that the distance has no impact for this subset:

dft$distance[dft$other_car==0]<-0

dft$other_car<- factor(dft$other_car)

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.015  -8.500  -3.876   8.894  21.000 

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          39.0000     5.0405   7.737 8.96e-06 ***
other_car1            4.6480    13.0670   0.356    0.729    
other_car0:distance       NA         NA      NA       NA    
other_car1:distance   0.3157     0.6133   0.515    0.617    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.34 on 11 degrees of freedom
Multiple R-squared:  0.1758,    Adjusted R-squared:  0.026 
F-statistic: 1.174 on 2 and 11 DF,  p-value: 0.3452

Another workaround could be to convert the factor to numeric, but this isn't the same model:

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)



dft$other_car<- as.numeric(factor(dft$other_car))

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
        2         6         7         8         9        11        13 
  0.03776   3.72205  19.77341 -15.38369 -16.01511  10.61782  -2.75227 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)         43.6480    12.9010   3.383   0.0196 *
other_car                NA         NA      NA       NA  
other_car:distance   0.1579     0.3281   0.481   0.6508  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.27 on 5 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.04424,   Adjusted R-squared:  -0.1469 
F-statistic: 0.2314 on 1 and 5 DF,  p-value: 0.6508

Which tells that speeds increases with distance to other car (or the other way round, when the other car is too near, drivers tend to slow down).

Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Thank you for your answer. If I am getting this right, in the second model we can see that speed increases with distance, but we can't compare the speeds based on the variable other_car ? (i.e. include not only the interaction other_car:distance , but also consider the variable other_car as a factor variable). Would it be accurate to follow the first approach but assign 0 to all NAs? – Anna Apr 28 '21 at 17:19
  • Assigning the same constant value to all `NA`s indeed works. My intuition would have been to take a high value instead of 0, because this is like having no car behind, but this makes no difference – Waldi Apr 28 '21 at 19:29