Predict () based on a given x - many ppl with this problem, yet non of the answers work

Question

I'm trying to use a multiregression model to predict values based on a given x, I've seen that lots of people have had this same issue yet none of the answers given so far have worked for me.

My model is

M_PS_av <- glm.nb(PS_av ~ poly(Age_a,2) + Income_a + Education_a + GroupA_a + GroupB_a + GroupC_a + GroupD_a + GroupE_a, data = BCC_a)

I'm interested in the effects of age and specifically when the age peak is met, therefore I want to predict based upon age only.

So far I've tried

predict(M_PS_av, data.frame(Age_a = 15))
predict(M_PS_av, data.frame(Age_a=Age_a[15]))
predict(M_PS_av, newdata = new.ages)

where I created another dataframe but this didn't return what I was after

I've also tried giving values for the different variables, and using this as my data.frame:

df <- data.frame(Age_c=19,Income_a=1, Education_a=1, GroupA_a=1, GroupB_a=1, GroupC_a=1, GroupD_a=1, GroupEa=1)

I've also tried the poly with and without poly(..., raw=TRUE)

Yet I still get an error. This is the error I've gotten most of the time:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  variable lengths differ (found for 'Income_a')
In addition: Warning message:
'newdata' had 1 row but variables found have 1019 rows

Can anyone help?

Thanks!

If you want a model with multiple parameters to give you a prediction you can't just feed it a single parameter (age) and expect a result. That's like saying "I know the formula for the volume of a box is length x width x height, but I want to know what the volume is when the length is 30cm - why won't the formula give me the volume?" It _might_ be reasonable to take the _average_ of each of your other variables as inputs, but without knowing anything about your model, I can't say this with certainty — Allan Cameron, Jul 17 '20 at 16:59
Are some of your variables categorical (factors)? Just passing in "1" for all the variables only works if they are all numeric. In order to predict, you need to pass in an appropriate value for all your predictors in your model. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. — MrFlick, Jul 17 '20 at 17:00

Allan Cameron · Accepted Answer · 2020-07-17T19:04:55.897

The hardest part of this was trying to recreate your data structure so that we could have a working example for your code. Of course, the values and factor levels will be completely different from your own data, but it should be enough for a demonstration:

set.seed(69)

df <- data.frame(Education_a = factor(c("Private", "Public")),
                 GroupA_a = factor(c("A1", "A2")),
                 GroupB_a = factor(c("B1", "B2")),
                 GroupC_a = factor(c("C1", "C2")),
                 GroupD_a = factor(c("D1", "D2")),
                 GroupE_a = factor(c("E1", "E2")))

BCC_a          <- expand.grid(df)[rep(1:64, 20), ]
BCC_a$Age_a    <- round(rgamma(64 * 20, 15, 1))
BCC_a$Income_a <- rgamma(64 * 20, 15, 1/2000)
lambdas        <- apply(do.call(cbind, lapply(BCC_a[1:6], 
                                       function(x) runif(2, 0.5, 1.5)[as.numeric(x)]
                                )), 1, prod)
BCC_a$PS_av    <- rpois(nrow(BCC_a), 1 + lambdas/2 * BCC_a$Age_a^2 + 0.001 * BCC_a$Income_a)

Here I have assumed that age and income are numeric variables, whereas the groups are factor variables:

 head(BCC_a)
#>   Education_a GroupA_a GroupB_a GroupC_a GroupD_a GroupE_a Age_a Income_a PS_av
#> 1     Private       A1       B1       C1       D1       E1    15 30500.19   162
#> 2      Public       A1       B1       C1       D1       E1    16 41160.54   170
#> 3     Private       A2       B1       C1       D1       E1    13 43146.83   107
#> 4      Public       A2       B1       C1       D1       E1    18 33023.85   124
#> 5     Private       A1       B2       C1       D1       E1     8 31122.07    65
#> 6      Public       A1       B2       C1       D1       E1    21 26487.43   215

Now let's create your model:

library(MASS)
M_PS_av <- glm.nb(PS_av ~ poly(Age_a,2) + Income_a + Education_a + GroupA_a +
                          GroupB_a + GroupC_a + GroupD_a + GroupE_a, data = BCC_a)

And we can review it with summary(M_PS_av)

#> glm.nb(formula = PS_av ~ poly(Age_a, 2) + Income_a + Education_a + 
#>     GroupA_a + GroupB_a + GroupC_a + GroupD_a + GroupE_a, data = BCC_a, 
#>     init.theta = 814.4965099, link = log)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.4821  -0.6993  -0.0217   0.6828   4.1628  
#> 
#> Coefficients:
#>                     Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)        4.750e+00  1.273e-02 372.981  < 2e-16 ***
#> poly(Age_a, 2)1    1.309e+01  1.012e-01 129.326  < 2e-16 ***
#> poly(Age_a, 2)2   -1.077e+00  8.885e-02 -12.118  < 2e-16 ***
#> Income_a           8.215e-06  3.486e-07  23.565  < 2e-16 ***
#> Education_aPublic -1.487e-01  5.464e-03 -27.218  < 2e-16 ***
#> GroupA_aA2        -3.534e-01  5.523e-03 -63.989  < 2e-16 ***
#> GroupB_aB2        -2.518e-02  5.481e-03  -4.593 4.37e-06 ***
#> GroupC_aC2         7.447e-02  5.445e-03  13.676  < 2e-16 ***
#> GroupD_aD2        -3.102e-02  5.442e-03  -5.701 1.19e-08 ***
#> GroupE_aE2        -4.514e-02  5.446e-03  -8.289  < 2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for Negative Binomial(814.4965) family taken to be 1)
#> 
#>     Null deviance: 26983  on 1279  degrees of freedom
#> Residual deviance:  1345  on 1270  degrees of freedom
#> AIC: 9952.3
#> 
#> Number of Fisher Scoring iterations: 1
#> 
#>               Theta:  814 
#>           Std. Err.:  234 
#>  2 x log-likelihood:  -9930.252

Now, to use predict we need a data frame of the predictors set to the levels we want to examine. Note we need all predictors, and if there are factor variables, we need to give the named factor levels:

new_data <- data.frame(Age_a = 15, Income_a = mean(BCC_a$Income_a), 
                       Education_a = "Private", GroupA_a = "A1", GroupB_a = "B1", 
                       GroupC_a = "C1", GroupD_a = "D1", GroupE_a = "E1")

Now we just plug this into predict. Note that we need to use type = "response" to get the actual expected value of the outcome variable (otherwise we will get the natural log of the expected value):

 predict(M_PS_av, newdata = new_data, type = "response")
#>        1 
#> 153.0262

This looks correct for the data I have put in.

Brilliant, this worked. My issue was in the way I was setting up the data frame and that I wasn't using type = "response". Thank you! — Orla, Jul 18 '20 at 06:59
I'm glad I could help @Orla. If this has solved your problem please consider marking it as accepted to help others with similar issues find a solution. — Allan Cameron, Jul 18 '20 at 07:15
Hi @Allan I tried marking your post as useful (as it is!) but I'm being told that I don't have enough reputation points, probably because I'm pretty new to the site. When I've improved my rep I'll come back and mark it! — Orla, Jul 29 '20 at 06:09
@Orla you have enough rep to _accept_ the answer - just click on the tick next to the answer. Thanks — Allan Cameron, Jul 29 '20 at 06:35

Predict () based on a given x - many ppl with this problem, yet non of the answers work

1 Answers1