-1

I have an lm model in R that I have trained and serialized. Inside a function, where I pass as input the model and a feature vector (one single array), I have:

CREATE OR REPLACE FUNCTION lm_predict(
    feat_vec float[],
    model bytea
)
RETURNS float
AS
$$
    #R-code goes here.
    mdl <- unserialize(model)
    # class(feat_vec) outputs "array"
    y_hat <- predict.lm(mdl, newdata = as.data.frame.list(feat_vec))
    return (y_hat)
$$ LANGUAGE 'plr';

This returns the wrong y_hat!! I know this because this other solution works (the inputs to this function are still the model (in a bytearray) and one feat_vec (array)):

CREATE OR REPLACE FUNCTION lm_predict(
    feat_vec float[],
    model bytea
)
RETURNS float
AS
$$
    #R-code goes here.
    mdl <- unserialize(model)
    coef = mdl$coefficients
    y_hat = coef[1] + as.numeric(coef[-1]%*%feat_vec)
    return (y_hat)
$$ LANGUAGE 'plr';

What am I doing wrong?? It is the same unserialized model, the first option should give me the right answer as well...

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
strv7
  • 109
  • 10
  • Is this R code? It looks like half python; colons don't work that way in R, nor does `return` or `+`. – alistaire Sep 16 '16 at 04:12
  • Yes, it is R + pseudocode - you can ignore the function declaration Actually - this is inside a PL/R function in Postgres but I didn't want to give focus on Postgres – strv7 Sep 16 '16 at 04:33
  • ...so how is pseudocode returning a result, correct or not? – alistaire Sep 16 '16 at 04:37
  • I have made some edits to my question, hopefully it is clear now. The first options returns a wrong number, whilst the second returns the correct prediction! I have no errors however – strv7 Sep 16 '16 at 04:38
  • Better, but still not answerable without [a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – alistaire Sep 16 '16 at 04:41
  • I have added the function calls – strv7 Sep 16 '16 at 04:55
  • It is R inside PostgreSQL – strv7 Sep 16 '16 at 04:57

1 Answers1

2

The problem seems to be the use of newdata = as.data.frame.list(feat_vec). As discussed in your previous question, this returns ugly column names. While when you call predict, newdata must have column names consistent with covariates names in your model formula. You should get some warning message when you call predict.

## example data
set.seed(0)
x1 <- runif(20)
x2 <- rnorm(20)
y <- 0.3 * x1 + 0.7 * x2 + rnorm(20, sd = 0.1)

## linear model
model <- lm(y ~ x1 + x2)

## new data
feat_vec <- c(0.4, 0.6)
newdat <- as.data.frame.list(feat_vec)
#  X0.4 X0.6
#1  0.4  0.6

## prediction
y_hat <- predict.lm(model, newdata = newdat)
#Warning message:
#'newdata' had 1 row but variables found have 20 rows 

What you need is

newdat <- as.data.frame.list(feat_vec,
                             col.names = attr(model$terms, "term.labels"))
#   x1  x2
#1 0.4 0.6

y_hat <- predict.lm(model, newdata = newdat)
#        1 
#0.5192413 

This is the same as what you can compute manually:

coef = model$coefficients
unname(coef[1] + sum(coef[-1] * feat_vec))
#[1] 0.5192413 
Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • I don't get to see warning messages when calling R from Postgres... But something is definitely wrong – strv7 Sep 16 '16 at 04:56
  • thank you for your answer. I really appreciate it. It is still not working for me though, y_hat returns always the same result while the "manual" computation returns correct predictions. I don't understand why :/ Why do I need to include the col.names?? Is that really important? – strv7 Sep 16 '16 at 06:39
  • This solved my issue when working with randomForest... thanks! I still get the weird behavior with lm but happy I got it to work with another regression model and exactly the same code! – strv7 Sep 16 '16 at 07:35