How to do a Leave One Out cross validation by group / subset?

Question

This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).

I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.

The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?

library(modelr)
install.packages("gapminder")
library(gapminder)                           
data(gapminder) 

#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)

models <- list(fit_model1 = model1,fit_model2 = model2)

gapminder %>% nest_by(continent, country) %>%
  bind_cols(
    map(1:nrow(gapminder), function(i) {
      map_dfc(models, function(model) {
        training <- data[-i, ] 
        fit <- lm(model, data = training)
        
        validation <- data[i, ]
        predict(fit, newdata = validation)
        
      })
    }) %>%
      bind_rows()
  )

model1 <- lm(lifeExp ~ pop, data = gapminder, subset = (country == USA)), model1 <- lm(lifeExp ~ pop, data = gapminder, subset = (country == FRANCE))....etc — user3670179, Dec 24 '20 at 16:45
I simply put that to illustrate what I'm trying to achieve. I'm trying to subset by continent/country without having to create multiple models — user3670179, Dec 24 '20 at 16:47
I understand, however `lm(lifeExp ~ pop, data=gapminder, subset=country)` literally works and is different from `lm(lifeExp ~ pop, data=gapminder)`, that's actually weird. — jay.sf, Dec 24 '20 at 16:48
In addition, you''re talking about a "leave-one-out" approach while apparently attempting to calculate a model for each country. Could you elaborate on that? — jay.sf, Dec 24 '20 at 17:35
I'm trying to do a leave one out within each country subset. — user3670179, Dec 24 '20 at 17:49
Good. And what about the output: what do you expect it to look like, could you show that? — jay.sf, Dec 24 '20 at 17:57
I simply need the estimate of each model added at the end of the original dataset for each group of countries. I've replaced group_by by nest_by but I'm getting an error — user3670179, Dec 24 '20 at 18:02
I noticed in your question history that you're on a painful journey to find an answer for your LOO issue, and that you've probably gotten confused by different coding styles. I came up with a solution which is based on [this Q&A](https://stackoverflow.com/a/55670592/6574038) which you probably know, as well as the code snippets you've thrown in yourself from time to time. I hope that I have understood your requirements correctly and the results are the ones you want. — jay.sf, Dec 24 '20 at 20:23

jay.sf · Accepted Answer · 2020-12-26T19:42:11.923

The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.

For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.

Part I

First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.

FOAE <- list(fit1=Petal.Length ~ Sepal.Length, 
             fit2=Petal.Length ~ Sepal.Length + Petal.Width)

For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.

im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE), 
             dimnames=list(NULL, names(FOAE)))

This would look like this:

head(im, n=3)
#      fit1 fit2
# [1,]   NA   NA
# [2,]   NA   NA
# [3,]   NA   NA

Now we loop over formulas and rows as described above.

for (i in seq(FOAE)) {
  for(j in seq(nrow(iris))) {
    train <- iris[-j,]  
    test <- iris[j,]    
    fit <- lm(FOAE[[i]], data=train)
    im[j, i] <- predict(fit, newdata=test)
  }
}

im has now been filled, and we may cbind it to the original iris data set to get our result res1.

res1 <- cbind(iris, im)
head(res1)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# 1          5.1         3.5          1.4         0.2  setosa 2.388501 1.611976
# 2          4.9         3.0          1.4         0.2  setosa 2.014324 1.501389
# 3          4.7         3.2          1.3         0.2  setosa 1.639805 1.392955
# 4          4.6         3.1          1.5         0.2  setosa 1.446175 1.333199
# 5          5.0         3.6          1.4         0.2  setosa 2.201646 1.556620
# 6          5.4         3.9          1.7         0.4  setosa 2.944788 2.127184

To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).

FUN1 <- Vectorize(function(x, y) {
  train <- iris[-x,]
  test <- iris[x,]
  fit <- lm(y, data=train)
  predict(fit, newdata=test)
})

Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.

o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)

head(res2)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# 1          5.1         3.5          1.4         0.2  setosa 2.388501 1.611976
# 2          4.9         3.0          1.4         0.2  setosa 2.014324 1.501389
# 3          4.7         3.2          1.3         0.2  setosa 1.639805 1.392955
# 4          4.6         3.1          1.5         0.2  setosa 1.446175 1.333199
# 5          5.0         3.6          1.4         0.2  setosa 2.201646 1.556620
# 6          5.4         3.9          1.7         0.4  setosa 2.944788 2.127184

## test if results are different is negative 
stopifnot(all.equal(res1, res2))

Part II

We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.

FUN2 <- Vectorize(function(x, y) {
  train <- iris[!iris$Species %in% x,]
  test <- iris[iris$Species %in% x,]
  fit <- lm(y, data=train)
  mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)

o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)

This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.

o2
#                fit1     fit2
# setosa     3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica  4.911009 5.976922

res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])

head(res3)
#          Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa            5.1         3.5          1.4         0.2  setosa 3.609943 2.662609
# setosa.1          4.9         3.0          1.4         0.2  setosa 3.609943 2.662609
# setosa.2          4.7         3.2          1.3         0.2  setosa 3.609943 2.662609
# setosa.3          4.6         3.1          1.5         0.2  setosa 3.609943 2.662609
# setosa.4          5.0         3.6          1.4         0.2  setosa 3.609943 2.662609
# setosa.5          5.4         3.9          1.7         0.4  setosa 3.609943 2.662609

tail(res3)
#              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species     fit1     fit2
# virginica.44          6.7         3.3          5.7         2.5 virginica 4.911009 5.976922
# virginica.45          6.7         3.0          5.2         2.3 virginica 4.911009 5.976922
# virginica.46          6.3         2.5          5.0         1.9 virginica 4.911009 5.976922
# virginica.47          6.5         3.0          5.2         2.0 virginica 4.911009 5.976922
# virginica.48          6.2         3.4          5.4         2.3 virginica 4.911009 5.976922
# virginica.49          5.9         3.0          5.1         1.8 virginica 4.911009 5.976922

Edit

In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).

FUN3 <- Vectorize(function(x, y) {
  train <- iris[!iris$Species %in% x,]
  test <- iris[iris$Species %in% x,]
  fit <- lm(y, data=train)
  (predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)

o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)

res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
#          Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa.1          5.1         3.5          1.4         0.2  setosa 3.706940 2.678255
# setosa.2          4.9         3.0          1.4         0.2  setosa 3.500562 2.547587
# setosa.3          4.7         3.2          1.3         0.2  setosa 3.294183 2.416919
# setosa.4          4.6         3.1          1.5         0.2  setosa 3.190994 2.351586
# setosa.5          5.0         3.6          1.4         0.2  setosa 3.603751 2.612921
# setosa.6          5.4         3.9          1.7         0.4  setosa 4.016508 3.073249

Edit 2

As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.

FUN4 <- Vectorize(function(x, y) {
  ## subsets first by cluster then by row
  ss <- iris[iris$Species %in% x,]  ## cluster subset
  sapply(1:nrow(ss), function(z) {  ## subset rows using `sapply`
    train <- ss[-z,]  ## train data w/o row z
    test <- ss[z,]    ## test data for `predict`, just row z
    fit <- lm(y, data=train)
    predict(fit, newdata=test)
  })
}, SIMPLIFY=F)

## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length, 
             fit2=Petal.Length ~ Sepal.Length + Petal.Width)

## unique cluster names
Species.u <- unique(iris$Species)

## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)

## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))

## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa.1          5.1         3.5          1.4         0.2  setosa 1.476004 1.451029
# setosa.2          4.9         3.0          1.4         0.2  setosa 1.449120 1.431737
# setosa.3          4.7         3.2          1.3         0.2  setosa 1.426185 1.416492
# setosa.4          4.6         3.1          1.5         0.2  setosa 1.404040 1.398103
# setosa.5          5.0         3.6          1.4         0.2  setosa 1.462460 1.441295
# setosa.6          5.4         3.9          1.7         0.4  setosa 1.504990 1.559045

Unfortunately that doesn't really help. Part 1 and part 2 use the entire iris dataset as the "data" used to build the linear regression. I only want to use each subset of species to build the models that would be used to calculate the fit for each of those species. In other words to calculate the fit for a setosa record, the model to be used would have to be built **ONLY** using setosa data and not the entire iris dataset. This issue can be resolved by using nest_by(species). Unfortunately I'm getting an error when doing so. — user3670179, Dec 26 '20 at 17:32
Another option would be to restrict train and test to the specific species data. something like: train <- iris[-j,] %>% group_by(species) test <- iris[j,] %>% group_by(species) — user3670179, Dec 26 '20 at 17:34
Or maybe add a dynamic subset to the model. Something like that: fit <- lm(y, data=train, subset = (species [x]). Obviously this doesn't work either — user3670179, Dec 26 '20 at 17:36
@user3670179 You are probably wrong, you should study the code I gave you better. What do you think e.g. `iris[-x,]` and `iris[!iris$Species %in% x,]` would do other than subset the `iris`data? Compare `?subset` and `?Extract` by typing it into the console. — jay.sf, Dec 26 '20 at 17:43
Part III is taking averages which is not what I'm looking for. Each row has to be evaluated individually. I tried to remove mean but that gives me an error. — user3670179, Dec 26 '20 at 17:52
train <- iris[-j,] meas that with the exception of row j all the data in the iris dataset are being used. Which is not what I want — user3670179, Dec 26 '20 at 17:54
@user3670179 Part I leaves out one row at each iteration, Part II leaves out one cluster at each iteration. Part II is what you wanted. Part II calculates a train model on `iris[!iris$Species %in% x,]` and evaluates it on a test model `iris[iris$Species %in% x,]`. And I wrote, "Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the `mean` prediction of each cluster.". I have now added `FUN3` which in principle `rbind`s the results for each cluster, perhaps that's what you wanted. — jay.sf, Dec 26 '20 at 18:18
"Part II leaves out one cluster". **That's not** what I'm trying to achieve. I'm trying to only have one cluster considered (e.g. setosa) and within that cluster leave out the first row in the train dataset. When the record is virginica, only keep the virginica cluster, create the train dataset with that cluster (using leave one out) create the test dataset using that cluster and evaluate the fit for all the ***virginica*** records using that model. The use a similar approach for the other species. — user3670179, Dec 26 '20 at 18:54
That's why I'm trying to use nest_by to restrict the data by species at the beginning of the script. — user3670179, Dec 26 '20 at 18:55
@user3670179 Ok, perhaps now I understood, see Edit 2, don't needed `nest_by` though. — jay.sf, Dec 26 '20 at 19:43
one advantage of nest by is that you can do such analysis with multiple variable (e.g. species and Petal.Width for instance) instead of just species. How can I modify your last code for multiple variables? the idea would be to do ss <- iris[iris$Species %in% x & iris$Petal.length %in% y ,] something like that — user3670179, Dec 26 '20 at 22:41
I believe why you're asking is that you attempt to group by continents then by country. However since country is a subset of continent (in most cases), you'd just do it along the countries, don't you? Could you be more precise? — jay.sf, Dec 27 '20 at 06:55
yes. but the gapminder is just an example. In the case I'm actually working on you do need to group by several characteristics because one single column doesn't make the record unique. In other words imagine if there were countries called France in several continents. However I found a solution by combining those columns into a single one and therefore making it a unique key, the solution you suggested works. Thank you — user3670179, Dec 27 '20 at 15:26
Could you help me resolve this? Thanks: https://stackoverflow.com/questions/65623154/coefficient-table-of-grouped-linear-regression-using-leave-one-out-approach?noredirect=1#comment116024568_65623154 — user3670179, Jan 08 '21 at 12:32

How to do a Leave One Out cross validation by group / subset?

1 Answers1

Part I

Part II

Edit

Edit 2

Linked