0

I have a data frame of ~100 observations of 38 demographic variables, as well as pre- and post-test scores in six domains (var1:var6). I fitted a linear model using lm() such that test.lm <- lm(var1_post ~ var1_pre + dem1 + dem2 + ... + dem38, data=test.df). The data frame test.df is a subset of a larger data frame, fulldata.df. In fulldata.df, I have 17 observations that do not have complete post data for var3_post and var4_post. However, test.df does not include those columns. It is just var1_pre, all the demographic variables, and var1_post. There are no missing values at all in test.df.

When I run summary(test.lm), it tells me that 17 observations have been removed for missingness. Presumably the 17 from fulldata?

Coefficients: (11 not defined because of singularities)                 
            Estimate    Std. Error  t value Pr(>|t|)    
intercept   -2.36E+01   9.84E+00    -2.403  0.02076     *
var1_pre    7.96E+00    1.48E+00    5.368   3.20E-06    ***
dem1        1.90E+00    1.16E+00    1.631   0.11037 
dem2        1.43E-04    1.02E-01    0.001   0.99889 
dem3        -7.52E-01   1.14E+00    -0.66   0.51277 
dem4        7.65E-02    1.65E-01    0.463   0.6459  
...
dem38      -2.50E+00    2.89E+00    -0.866  0.39135


Residual standard error: 3.93 on 42 degrees of freedom                  
  (17 observations deleted due to missingness)                  
Multiple R-squared:  0.6452,    Adjusted R-squared:  0.4003                     
F-statistic: 2.634 on 29 and 42 DF,  p-value: 0.002075                  

It doesn't make sense to me at all that lm() would recognize missingness from the larger data frame, but I cannot figure out where else the "missing" 17 observations would be coming from. Even when running which(!complete.cases(test.df)) it returns integer(0).

Anyone have any thoughts as to where those 17 observations could be or how I might go about identifying them?

lrankin07
  • 1
  • 2

0 Answers0