1

In the past I've used the lm function with matrix-type data and data.frame-type. But I guess this is the first time that I tried to use predict using a model fitted without a data.frame. And I'm can't figure out how to make it work.

I read some other questions (such as Getting Warning: " 'newdata' had 1 row but variables found have 32 rows" on predict.lm) and I'm pretty sure that my problem is related with the coefficient names I'm getting after fitting the model. For some reason the coefficients names are a paste of the matrix name with the column name... and I haven't been able to find how to fix that...

library(tidyverse)
library(MASS)

set.seed(1)
label <- sample(c(T,F), nrow(Boston), replace = T, prob = c(.6,.4))

x.train <- Boston %>% dplyr::filter(., label) %>%
  dplyr::select(-medv) %>% as.matrix()
y.train <- Boston %>% dplyr::filter(., label) %>%
  dplyr::select(medv) %>% as.matrix()
x.test <- Boston %>% dplyr::filter(., !label) %>%
  dplyr::select(-medv) %>% as.matrix()
y.test <- Boston %>% dplyr::filter(., !label) %>%
  dplyr::select(medv) %>% as.matrix()

fit_lm <- lm(y.train ~ x.train)
fit_lm2 <- lm(medv ~ ., data = Boston, subset = label)
predict(object = fit_lm, newdata = x.test %>% as.data.frame()) %>% length() 
predict(object = fit_lm2, newdata = x.test %>% as.data.frame()) %>% length()
# they get different numbers of predicted data
# the first one gets a number a results consistent with x.train

Any help will be welcome.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
user5029763
  • 1,903
  • 1
  • 15
  • 23

1 Answers1

1

I can't fix your tidyverse code because I don't work with this package. But I am able to explain why predict fails in the first case.

Let me just use the built-in dataset trees for a demonstration:

head(trees, 2)
#  Girth Height Volume
#1   8.3     70   10.3
#2   8.6     65   10.3

The normal way to use lm is

fit <- lm(Girth ~ ., trees)

The variable names (on the RHS of ~) are

attr(terms(fit), "term.labels")
#[1] "Height" "Volume"

You need to provide these variables in the newdata when using predict.

predict(fit, newdata = data.frame(Height = 1, Volume = 2))
#       1 
#11.16125 

Now if you fit a model using a matrix:

X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"

The variable you need to provide in newdata for predict is now X, not Height or Girth. Note that since X is a matrix variable, you need to protect it with I() when feeding it to a data frame.

newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame':  1 obs. of  1 variable:
# $ X: AsIs [1, 1:2] 1 2

predict(fit2, newdat)
#       1 
#11.16125 

It does not matter that cbind(1, 2) has no column names. What is important is that this matrix is named X in newdat.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248