Predict values from fitted logistic model by group

Question

Trying to fit multiple logistic models to data for different counties and would like it all together in one data frame at the end (all counties, all predicted populations, for specified years).

Here's the data:

county <- structure(list(name = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 
8L, 9L, 9L, 9L, 9L, 9L), .Label = c("Alachua", "Columbia", "Gilchrist", 
"Lake", "Levy", "Marion", "Orange", "Seminole", "Volusia"), class = 
"factor"), 
year = c(1920L, 1940L, 1970L, 1990L, 2010L, 1920L, 1940L, 
1970L, 1990L, 2010L, 1920L, 1940L, 1970L, 1990L, 2010L, 1920L, 
1940L, 1970L, 1990L, 2010L, 1920L, 1940L, 1970L, 1990L, 2010L, 
1920L, 1940L, 1970L, 1990L, 2010L, 1920L, 1940L, 1970L, 1990L, 
2010L, 1920L, 1940L, 1970L, 1990L, 2010L, 1920L, 1940L, 1970L, 
1990L, 2010L), pop = c(24662.84498, 38518.67335, 105080.0739, 
182378.0527, 247964.4355, 14353.67655, 16988.63031, 25423.53768, 
42636.12851, 67396.52047, 6955.297482, 4331.7027, 3661.621676, 
9835.709676, 16780.95117, 12812.1731, 27202.15681, 65668.28125, 
153585.2153, 297441.8053, 10034.20186, 12707.52359, 12911.58508, 
26370.47373, 41650.51535, 23990.09377, 31340.67059, 69056.41468, 
194358.0547, 334117.7792, 19825.73528, 68559.76913, 337259.2307, 
670422.46, 1140314.083, 11027.52715, 23881.62063, 91628.11201, 
298115.877, 438079.7446, 24526.72497, 55775.68449, 175004.8787, 
382885.1367, 516049.0225)), .Names = c("name", "year", "pop"
), row.names = c(NA, -45L), class = "data.frame")

and here's what I've ended up with:

library(dplyr) 
county %>% 
    group_by(name) %>%
    (function(x) {
            fm<- nls(pop ~ SSlogis(year, phi1, phi2, phi3), data = x)
            timevalues <- c(1992, 2002, 2007, 2012)
            predict <- predict(fm,list(year=timevalues))
            cbind(predict, predict)
    })

but this only gives me a list of four data points:

out:
  predict  predict
[1,] 226713.5 226713.5
[2,] 293596.4 293596.4
[3,] 326455.5 326455.5
[4,] 357640.8 357640.8

with no clue as to what county they are for? If I use this code separately (without using groupby), I can get it to work. But then I have to do it separately for each county and then bind it all myself, which is going to get tedious once I'm working with more than 9 counties.

I don't use dplyr, but you can probably do this with `... %>% do({fm <- ...; data.frame(predict(...))})` — rawr, Jun 04 '18 at 03:14
give a name to your anonymous function and then call it with 'do' as in [this question](https://stackoverflow.com/questions/28447608/dplyr-from-group-by-into-own-function) — Esther, Jun 04 '18 at 03:50

Mikko Marttila · Answer 1 · 2018-06-04T06:49:26.907

As @Esther suggests in the comments, a good first step would be to extract your anonymous prediction function into a named one. It would also make sense to make the function accept the prediction years as an argument, rather than fixing them inside the function:

predict_pop <- function(data, year) {
  model <- nls(pop ~ SSlogis(year, phi1, phi2, phi3), data = data)

  nd <- data.frame(year)
  pred <- predict(model, nd)

  cbind(nd, pred)
}

Let's just check that this works with the full data:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

years <- c(1992, 2002, 2007, 2012)
predict_pop(county, years)
#>   year     pred
#> 1 1992 226713.5
#> 2 2002 293596.4
#> 3 2007 326455.5
#> 4 2012 357640.8

Great! Now one way (as suggested by @eipi10 in the comments) to fit the model for each county would be to first split() the data into a list of data frames for each county and then use lapply() to get predictions in each subset.

split(county, county$name) %>%
  lapply(predict_pop, years)
#> Error in nls(y ~ 1/(1 + exp((xmid - x)/scal)), data = xy, start = list(xmid = aux[[1L]], : step factor 0.000488281 reduced below 'minFactor' of 0.000976562

However, this results in an error: it seems that the model can't be fitted for some counties on their own. You'll probably need to address that with the model itself; but if we want predictions from this model for those counties that the model can be fitted for, we can modify the prediction function to handle cases where the model doesn't fit.

One way to do this would be to use purrr::safely() to make a "safe" version of the nls() function, that doesn't stop everything on an error, but instead always returns a two element list: a result, which contains the result if the function executed successfully, and NULL if there was an error; and an error which contains the error if one occurred.

With the safe modelling function, we can then check if the model could be fitted, and if not, return NA as the predictions rather than erroring. Here's a modified version of the prediction function that does just that:

predict_pop <- function(data, year) {
  safe_nls <- function(...) purrr::safely(nls)(...)$result
  model <- safe_nls(pop ~ SSlogis(year, phi1, phi2, phi3), data = data)

  nd <- data.frame(year)
  pred <- NA_real_

  if (!is.null(model))
    pred <- predict(model, nd)

  cbind(nd, pred)
}

Now we can use the technique form before to get the predictions. I added a bind_rows() call to combine the list results into a data frame instead:

split(county, county$name) %>%
  lapply(predict_pop, years) %>% 
  bind_rows(.id = "county") %>% 
  head()
#>     county year     pred
#> 1  Alachua 1992 186020.6
#> 2  Alachua 2002 222332.3
#> 3  Alachua 2007 239432.0
#> 4  Alachua 2012 255440.9
#> 5 Columbia 1992       NA
#> 6 Columbia 2002       NA

Here we can see the missing predictions for Columbia, one of the counties that the model fitting fails for.

There are also a couple of other ways to predict for each county. One such alternative, mentioned in the comments by both @rawr and @Esther, is to use do():

county %>% 
  group_by(name) %>% 
  do(predict_pop(., years)) %>% 
  head()
#> # A tibble: 6 x 3
#> # Groups:   name [2]
#>   name      year    pred
#>   <fct>    <dbl>   <dbl>
#> 1 Alachua   1992 186021.
#> 2 Alachua   2002 222332.
#> 3 Alachua   2007 239432.
#> 4 Alachua   2012 255441.
#> 5 Columbia  1992     NA 
#> 6 Columbia  2002     NA

Another way would be to create a "nested" data frame by assigning grouped data into a list column with tidyr::nest(). Then we can use lapply() to get predictions from models for each subset of the data, and finally tidyr::unnest() to get the predictions from the list column.

county %>% 
  tidyr::nest(-name) %>% 
  tidyr::unnest(lapply(data, predict_pop, years)) %>% 
  head()
#>       name year     pred
#> 1  Alachua 1992 186020.6
#> 2  Alachua 2002 222332.3
#> 3  Alachua 2007 239432.0
#> 4  Alachua 2012 255440.9
#> 5 Columbia 1992       NA
#> 6 Columbia 2002       NA

And there we have it: a whole host of techniques for handling many models. For further discussion and examples of this, you might be interested in the many models chapter in the R for Data Science book.

Created on 2018-06-04 by the reprex package (v0.2.0).

Predict values from fitted logistic model by group

1 Answers1