I have a data frame of ~100 observations of 38 demographic variables, as well as pre- and post-test scores in six domains (var1:var6). I fitted a linear model using lm() such that test.lm <- lm(var1_post ~ var1_pre + dem1 + dem2 + ... + dem38, data=test.df)
. The data frame test.df is a subset of a larger data frame, fulldata.df. In fulldata.df, I have 17 observations that do not have complete post data for var3_post and var4_post. However, test.df does not include those columns. It is just var1_pre, all the demographic variables, and var1_post. There are no missing values at all in test.df.
When I run summary(test.lm)
, it tells me that 17 observations have been removed for missingness. Presumably the 17 from fulldata?
Coefficients: (11 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
intercept -2.36E+01 9.84E+00 -2.403 0.02076 *
var1_pre 7.96E+00 1.48E+00 5.368 3.20E-06 ***
dem1 1.90E+00 1.16E+00 1.631 0.11037
dem2 1.43E-04 1.02E-01 0.001 0.99889
dem3 -7.52E-01 1.14E+00 -0.66 0.51277
dem4 7.65E-02 1.65E-01 0.463 0.6459
...
dem38 -2.50E+00 2.89E+00 -0.866 0.39135
Residual standard error: 3.93 on 42 degrees of freedom
(17 observations deleted due to missingness)
Multiple R-squared: 0.6452, Adjusted R-squared: 0.4003
F-statistic: 2.634 on 29 and 42 DF, p-value: 0.002075
It doesn't make sense to me at all that lm() would recognize missingness from the larger data frame, but I cannot figure out where else the "missing" 17 observations would be coming from. Even when running which(!complete.cases(test.df))
it returns integer(0)
.
Anyone have any thoughts as to where those 17 observations could be or how I might go about identifying them?