-1

I have some x and y values which can be fitted nicely with a polynomial

> mysubx
[1]  0.05  0.10  0.20  0.50  1.00  2.00  5.00
[8]  9.00 12.30 18.30
> mysuby
[1] 1.008 1.019 1.039 1.091 1.165 1.258 1.402
[8] 1.447 1.421 1.278
> mymodel <- lm(mysuby ~ poly(mysubx,5))

The fit can be confirmed graphically.

> plot(mysubx, mysuby)
> lines(mysubx, mymodel$fitted.values, col = "red")

Plot showing original data in black and fitted data in red

My problem happens when I try to use the coefficients returned by lm to determine a y value from a given x. So for example if I try to use the first value in mysubx this should give the mymodel$fitted.values1. From the graph it can be seen I should expect to see a number around 1.01.

> ansx = 0
> for(i in seq_along(mymodel$coefficients)){
+ ansx = ansx + mysubx[1]^(i-1)*mymodel$coefficients[[i]]
+ }
> ansx
[1] 1.229575
> 

Where

> mysubx[1]
[1] 0.05
> mymodel$coefficients
 (Intercept) poly(mysubx, 5)1 poly(mysubx, 5)2 poly(mysubx, 5)3 
  1.21280000       0.35310369      -0.35739878       0.10989141 
 poly(mysubx, 5)4 poly(mysubx, 5)5 
 -0.04608682       0.02054430 

As can be seen an x value on the graph of 0.05 does not give 1.229575. Obviously I don't understand what is going on? Can someone explain how I can get the correct y value from any given x value using output of the lm function? Thank you.

Linda Marsh
  • 105
  • 1
  • 1
  • 9

2 Answers2

2

In fact, what you want is not poly(mysubx, 5) but

poly(mysubx, 5, raw = TRUE)

If you let raw as FALSE, it does not use x, x**2, x**3, etc. but orthogonal polynomials.

mymodel <- lm(mysuby ~ poly(mysubx, 5, raw = T))
Pop
  • 12,135
  • 5
  • 55
  • 68
  • If I understand correctly raw = TRUE will work perfectly fine for smooth data like my example, but noisy data will be better handled using the raw = FALSE option ? – Linda Marsh Mar 19 '18 at 22:24
2

When you fit a model, R first builds a model matrix from your data and your formula. You can get hold of it using the model.matrix function.

> X <- model.matrix(mysuby ~ poly(mysubx,5))

This matrix has a row per input point (in your case your input is one-dimensional and kept in mysubx, but in general, you will get it from a data frame and it can be multi-dimensional). The formula specifies how the input data should be modified before we fit the model. We can take a closer look at the first row:

> X[1,]
     (Intercept) poly(mysubx, 5)1 poly(mysubx, 5)2 
       1.0000000       -0.2517616        0.2038351 
poly(mysubx, 5)3 poly(mysubx, 5)4 poly(mysubx, 5)5 
      -0.2264003        0.2355258       -0.2245773 

As you can see, when you fit a polynomial, you get values for the intercept (always 1 since the intercept is a constant for the model; it doesn't depend on x) and the transformations you do on your input. WE call such a row the "features" you use in your model

In this case, you have a 1->N dimensional mapping from input to features. In general, it will be an M -> N-dimensional mapping. Regardless of how you map input to the model matrix, the model fitting only cares about the model matrix. The model builds a way to map each row in this matrix to a prediction.

For a linear model, the mapping from features to target variable is an inner product. You take the coefficients and compute the inner product with the features. So, for your first data point, you do:

> mymodel$coefficients %*% X[1,]
     [,1]
[1,] 1.010704

For the entire data, you simply do this for each row:

> predict(mymodel)
       1        2        3        4        5        6        7 
1.010704 1.020083 1.038284 1.088659 1.159883 1.263722 1.400163 
       8        9       10 
1.447700 1.420790 1.278011 
> apply(X, MARGIN = 1, function(features) mymodel$coefficients %*% features)
       1        2        3        4        5        6        7 
1.010704 1.020083 1.038284 1.088659 1.159883 1.263722 1.400163 
       8        9       10 
1.447700 1.420790 1.278011 

Here, X doesn't have to be the data you trained the model on. You can build it from any other input data using the same formula. I would recommend not using global variables in your formulae, though, as this is likely to cause problems later on.

Thomas Mailund
  • 1,674
  • 10
  • 16