0

I have a built some models using lm(). The response variable is the abundance of a species at one of two locations each month. It is given as a percentage to 6 decimal place. Percentages have to be used as the data was collected via citizen science where the actual monthly total recorded each month is not reliable but the overall proportion (%) at each of the two locations is.

The best fit model has two explanatory variables which are wind speed and wind direction, both numerical. I would like to apply the predict() function. So far, I have been able to do this by following the instructions from the post here as shown below.

model <- lm(y~ x1, data=df)
new.df <- data.frame(x1=c(0, 10, 20))
predict(model, new.df) 

This seems to work well for models with just a single exploratory variable but I am having trouble adding a second so it works on my best fit model.

So far, this is what I have come up with, however, the results do not make sense as two are negative numbers.

model2 <- lm(y ~ x1+x2, data=df)
new.df <- data.frame(x1=c(1, 6, 12), (x2=c(1, 10, 20)))
predict(model2, new.df)

 1          2          3 
 0.4123114 -0.3975497 -1.3014379 

I would be grateful if anyone could offer any suggestions.

Jo Harris
  • 98
  • 9
  • the `lm` model does not know your `y` can not be negative so given some plausible `x` combination it can predict negative `y`. What you are looking for most likely is `glm` with `family = poisson`. – missuse Mar 30 '18 at 11:47
  • I can't see anything wrong with your code; it's hard to say more with you not providing details about `df` and/or the model fit. What is the quality of the fitted linear model? Please review how to provide a [minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including sample data. – Maurits Evers Mar 30 '18 at 11:48
  • Can't really help without some sample of your input data, but @missuse is correct. `lm` appears to be working correctly. If negative values are impossible in your data, you'll need to specify a `glm` linking function that is more suited to your data. – jdobres Mar 30 '18 at 11:49
  • Thank you for the feedback. I have updated my question and I hope it is sufficient. I began with 'glm' but I was having difficulty because the response variable is a percentage with 6 decimal place. I think I may have to look for a different method to predict how these explanatory variables may affect the response. – Jo Harris Mar 30 '18 at 12:20

2 Answers2

0

if you need x1 + x2 and the interaction of both (y ~ x1 +x2 +x1:x2) try this:

> df <- data.frame(x1=c(2, 12, 24), x2=c(2, 20, 40), y=c(1,2,3)) # example DF

> model2 <- lm(y ~ x1*x2, data=df)
> new.df <- data.frame(x1=c(1, 6, 12), (x2=c(1, 10, 20)))
> predict(model2, new.df)
  1   2   3 
1.0 1.5 2.0 
  • That's perfect, thank you. I am sorry to admit; I found a silly error in my script. The x2 in my csv. is in decimals, not whole numbers. I have now adjusted to 0.1, 0.2 and so on and it seems to be working. – Jo Harris Mar 30 '18 at 16:33
0

I found the problem. My response variable had been transformed to ensure assumptions were satisfied. Therefore, the output from predict() returned values in their transformed state.

Jo Harris
  • 98
  • 9