Combine predicted value of probit model into original dataframe

Question

I have the data like this:

    df <- tibble::tibble(
        id = rep(c(1:50), each = 5),
        y = runif(250,min = 0, max = 1),
        x1 = rnorm(250, mean = 0, sd=1),
        x2 = rnorm(250, mean = 0, sd=1),
        x3 = rnorm(250, mean = 0, sd=1),
        x4 = rnorm(250, mean = 0, sd=1), 
        x5 = rnorm(250, mean = 0, sd=1), 
) %>% 
        group_by(id) %>% 
        mutate(year = rep(c(2001:2005)))

I would like to estimate the probit model for every year to get (1)coefficient estimates,and (2) predicted value of y, and (3) number of observations used to estimate the model:

probit_model <- function(df) {
        glm (y ~ x1 + x2 + x3 + x4+ x5,
            family = binomial(link = "probit"),
            data = df)
}

Do you know how we can get the coefficient estimates, predicted value for every year and then combine them with the original data (that is df) here? I know what we can do with OLS model (by using map function to estimate for multiple models). But I do not know how to do with probit regression.

Thank you so much.

*"I know what we can do with OLS model (by using map function to estimate for multiple models)"*. If you've got code that works with `lm`, I would think basically the same code would work for your probit model. You may need to add the argument `type = "response"` to `predict()`, if you're using `predict()`, but pretty much everything else should be the same. — Gregor Thomas, Jan 19 '22 at 21:26
with ```lm``` they use add_predictions, but I don't know how to modify it with probit:(. And I've only got code to get coefficients and residual (by ``` mutate(resids = map2(data, model, add_residuals))``` like that). — dungnttd, Jan 19 '22 at 21:31
Are you aware that it's (probably) a mistake to use a probit model unless you can identify the denominators (via the `weights=` argument) ? (Maybe this is just a feature of your data, not of your actual data/model ...) — Ben Bolker, Jan 20 '22 at 01:00

score 3 · Answer 1 · answered Jan 19 '22 at 23:50

I think you need to do this, I used this post as reference.

library(dplyr)

df <- tibble::tibble(
  id = rep(c(1:50), each = 5),
  y = runif(250,min = 0, max = 1),
  x1 = rnorm(250, mean = 0, sd=1),
  x2 = rnorm(250, mean = 0, sd=1),
  x3 = rnorm(250, mean = 0, sd=1),
  x4 = rnorm(250, mean = 0, sd=1), 
  x5 = rnorm(250, mean = 0, sd=1), 
) %>% 
  group_by(id) %>% 
  mutate(year = rep(c(2001:2005)))


fitted_models = df %>% group_by(year) %>% do(model = glm(y ~ x1 + x2 + x3 + x4+ x5,
                                                         family = binomial(link = "probit"),
                                                         data = .))

#fitted_models$year
#fitted_models$model[1]

fitted_models %>% summarise(broom::tidy(model))

## A tibble: 30 x 5
#term        estimate std.error statistic p.value
#<chr>          <dbl>     <dbl>     <dbl>   <dbl>
#  1 (Intercept)  -0.160      0.187    -0.856   0.392
#2 x1            0.0860     0.230     0.375   0.708
#3 x2            0.0657     0.187     0.351   0.725
#4 x3            0.0472     0.160     0.296   0.767
#5 x4            0.216      0.191     1.13    0.257
#6 x5           -0.159      0.263    -0.604   0.546
#7 (Intercept)  -0.0792     0.182    -0.434   0.664
#8 x1            0.0314     0.170     0.185   0.853
#9 x2           -0.0320     0.194    -0.164   0.869
#10 x3            0.167      0.218     0.763   0.445
## ... with 20 more rows

fitted_models %>% summarise(broom::glance(model))

## A tibble: 5 x 8
#null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
#<dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#  1          21.7      49  -32.5  77.0  88.5     19.7          44    50
#2          16.4      49  -33.4  78.8  90.3     15.7          44    50
#3          15.5      49  -34.5  81.1  92.5     15.2          44    50
#4          16.6      49  -32.4  76.7  88.2     15.0          44    50
#5          19.6      49  -33.3  78.6  90.0     19.1          44    50

fitted_models %>% summarise(broom::augment(model, type.predict = "response"))

## A tibble: 250 x 12
#y      x1      x2     x3      x4      x5 .fitted .resid .std.resid   .hat .sigma .cooksd
#<dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>   <dbl>
#  1 0.819   0.0246  0.0176  0.280  0.192   0.840    0.407  0.846      0.875 0.0665  0.664 0.00894
#2 0.0418  1.41    0.297   1.15  -1.41    0.347    0.372 -0.792     -0.853 0.137   0.665 0.0144 
#3 0.119  -0.265  -0.158  -1.37  -2.48   -0.504    0.237 -0.300     -0.327 0.156   0.676 0.00284
#4 0.0282 -0.836  -0.442  -1.63   0.506   0.910    0.355 -0.808     -0.858 0.114   0.665 0.0112 
#5 0.893  -0.481  -0.384  -0.974  0.897  -0.662    0.510  0.819      0.850 0.0703  0.665 0.00792
#6 0.865   0.417  -0.0233  0.841 -0.268  -0.140    0.451  0.865      0.883 0.0395  0.664 0.00494
#7 0.809   1.30   -0.469   1.01  -0.0913 -0.106    0.486  0.669      0.702 0.0921  0.669 0.00778
#8 0.0220  0.119  -0.580  -0.533 -1.09    0.0142   0.326 -0.780     -0.801 0.0522  0.666 0.00406
#9 0.722   0.194  -1.50   -0.395  1.65   -0.868    0.592  0.271      0.297 0.167   0.676 0.00281
#10 0.131   1.24    0.600   1.14  -1.17    0.370    0.392 -0.579     -0.618 0.122   0.671 0.00756
## ... with 240 more rows

I try with ```broom::augment``` as you suggested, but it didn't keep the "id" column. How can we merge with the original dataframe "df"? — dungnttd, Jan 26 '22 at 21:26

score 2 · Answer 2 · answered Jan 20 '22 at 02:42

2

A similar answer to @cdcarrion's, from the same post, but using map (a slightly newer approach than do()):

fit the models

library(broom)
models <- (df
  %>% group_by(year)
  %>% nest()
  %>% mutate(model = map(data, glm,
                         formula = y ~ x1 + x2 + x3 + x4+ x5,
                         family = binomial(link = "probit")))
)

get coefficients

coefs <- (models
  %>% mutate(cc = map(model, tidy))
  %>% select(year, cc)
  %>% unnest(cols = cc)
)

get predictions

preds <- (models
  %>% mutate(aug = map(model, augment, type.predict = "response"))
  %>% select(year, aug)
  %>% unnest(cols = aug)
  %>% select(year:x5, .fitted)
)

answered Jan 20 '22 at 02:42

Ben Bolker

211,554
25
370
453

I try with ```preds``` as you suggested, but it didn't keep the "id" column. How can we merge with the original dataframe "df"? – dungnttd Jan 26 '22 at 21:26
You can use `full_join()`, or it may work to use `data = df` as an additional argument within the `map()` call. – Ben Bolker Jan 26 '22 at 23:36

Combine predicted value of probit model into original dataframe

2 Answers2

fit the models

get coefficients

get predictions

Linked