R linear model (lm) predict function with one single array

Question

I have an lm model in R that I have trained and serialized. Inside a function, where I pass as input the model and a feature vector (one single array), I have:

CREATE OR REPLACE FUNCTION lm_predict(
    feat_vec float[],
    model bytea
)
RETURNS float
AS
$$
    #R-code goes here.
    mdl <- unserialize(model)
    # class(feat_vec) outputs "array"
    y_hat <- predict.lm(mdl, newdata = as.data.frame.list(feat_vec))
    return (y_hat)
$$ LANGUAGE 'plr';

This returns the wrong y_hat!! I know this because this other solution works (the inputs to this function are still the model (in a bytearray) and one feat_vec (array)):

CREATE OR REPLACE FUNCTION lm_predict(
    feat_vec float[],
    model bytea
)
RETURNS float
AS
$$
    #R-code goes here.
    mdl <- unserialize(model)
    coef = mdl$coefficients
    y_hat = coef[1] + as.numeric(coef[-1]%*%feat_vec)
    return (y_hat)
$$ LANGUAGE 'plr';

What am I doing wrong?? It is the same unserialized model, the first option should give me the right answer as well...

Is this R code? It looks like half python; colons don't work that way in R, nor does `return` or `+`. — alistaire, Sep 16 '16 at 04:12
Yes, it is R + pseudocode - you can ignore the function declaration Actually - this is inside a PL/R function in Postgres but I didn't want to give focus on Postgres — strv7, Sep 16 '16 at 04:33
I have made some edits to my question, hopefully it is clear now. The first options returns a wrong number, whilst the second returns the correct prediction! I have no errors however — strv7, Sep 16 '16 at 04:38
Better, but still not answerable without [a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — alistaire, Sep 16 '16 at 04:41

score 2 · Accepted Answer · edited May 23 '17 at 11:45

2

The problem seems to be the use of newdata = as.data.frame.list(feat_vec). As discussed in your previous question, this returns ugly column names. While when you call predict, newdata must have column names consistent with covariates names in your model formula. You should get some warning message when you call predict.

## example data
set.seed(0)
x1 <- runif(20)
x2 <- rnorm(20)
y <- 0.3 * x1 + 0.7 * x2 + rnorm(20, sd = 0.1)

## linear model
model <- lm(y ~ x1 + x2)

## new data
feat_vec <- c(0.4, 0.6)
newdat <- as.data.frame.list(feat_vec)
#  X0.4 X0.6
#1  0.4  0.6

## prediction
y_hat <- predict.lm(model, newdata = newdat)
#Warning message:
#'newdata' had 1 row but variables found have 20 rows

What you need is

newdat <- as.data.frame.list(feat_vec,
                             col.names = attr(model$terms, "term.labels"))
#   x1  x2
#1 0.4 0.6

y_hat <- predict.lm(model, newdata = newdat)
#        1 
#0.5192413

This is the same as what you can compute manually:

coef = model$coefficients
unname(coef[1] + sum(coef[-1] * feat_vec))
#[1] 0.5192413

edited May 23 '17 at 11:45

Community

1
1

answered Sep 16 '16 at 04:54

Zheyuan Li

71,365
17
180
248

I don't get to see warning messages when calling R from Postgres... But something is definitely wrong – strv7 Sep 16 '16 at 04:56
thank you for your answer. I really appreciate it. It is still not working for me though, y_hat returns always the same result while the "manual" computation returns correct predictions. I don't understand why :/ Why do I need to include the col.names?? Is that really important? – strv7 Sep 16 '16 at 06:39
This solved my issue when working with randomForest... thanks! I still get the weird behavior with lm but happy I got it to work with another regression model and exactly the same code! – strv7 Sep 16 '16 at 07:35

R linear model (lm) predict function with one single array

1 Answers1