The most succinct and straightforward solution would be a nested for
loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer
, which I also show afterwards.
For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris
data set, which is smaller and thus easier to handle. It contains a "Species"
column that is meant to correspond to the "countries"
in your data.
Part I
First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
For the loop, we want to initialize a matrix im
whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.
im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE),
dimnames=list(NULL, names(FOAE)))
This would look like this:
head(im, n=3)
# fit1 fit2
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
Now we loop over formulas and rows as described above.
for (i in seq(FOAE)) {
for(j in seq(nrow(iris))) {
train <- iris[-j,]
test <- iris[j,]
fit <- lm(FOAE[[i]], data=train)
im[j, i] <- predict(fit, newdata=test)
}
}
im
has now been filled, and we may cbind
it to the original iris
data set to get our result res1
.
res1 <- cbind(iris, im)
head(res1)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
To alternatively follow the outer
approach, we put the code inside the for
loop into a formula
which we Vectorize
so that it can handle matrix columns (i.e. vectors).
FUN1 <- Vectorize(function(x, y) {
train <- iris[-x,]
test <- iris[x,]
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
Now we put FOAE
and the rows 1:nrow(iris)
to leave out subsequently, together with FUN1
into outer()
. This already gives us the result that we can cbind
to iris
in the same way as above to get our result res2
.
o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)
head(res2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
## test if results are different is negative
stopifnot(all.equal(res1, res2))
Part II
We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer
method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species"
(in your case "countries"
), which unique
values we put into a vector Species.u
. Since the values are in "character"
or "factor"
format we subset the data using data[!data$cluster %in% x, ]
instead of data[-x, ]
. Because predict
would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean
prediction of each cluster. We use rownames
according to the cluster.
FUN2 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)
o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)
This now gives us a matrix which is smaller than our data set. Thanks to the rownames
we may match
the predictions tho the clusters to which they belong.
o2
# fit1 fit2
# setosa 3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica 4.911009 5.976922
res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])
head(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa 5.1 3.5 1.4 0.2 setosa 3.609943 2.662609
# setosa.1 4.9 3.0 1.4 0.2 setosa 3.609943 2.662609
# setosa.2 4.7 3.2 1.3 0.2 setosa 3.609943 2.662609
# setosa.3 4.6 3.1 1.5 0.2 setosa 3.609943 2.662609
# setosa.4 5.0 3.6 1.4 0.2 setosa 3.609943 2.662609
# setosa.5 5.4 3.9 1.7 0.4 setosa 3.609943 2.662609
tail(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# virginica.44 6.7 3.3 5.7 2.5 virginica 4.911009 5.976922
# virginica.45 6.7 3.0 5.2 2.3 virginica 4.911009 5.976922
# virginica.46 6.3 2.5 5.0 1.9 virginica 4.911009 5.976922
# virginica.47 6.5 3.0 5.2 2.0 virginica 4.911009 5.976922
# virginica.48 6.2 3.4 5.4 2.3 virginica 4.911009 5.976922
# virginica.49 5.9 3.0 5.1 1.8 virginica 4.911009 5.976922
Edit
In this version of FUN2
, FUN3
, the output of the models of each cluster are rbind
ed (in two columns of course, because of two models).
FUN3 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
(predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)
o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)
res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 3.706940 2.678255
# setosa.2 4.9 3.0 1.4 0.2 setosa 3.500562 2.547587
# setosa.3 4.7 3.2 1.3 0.2 setosa 3.294183 2.416919
# setosa.4 4.6 3.1 1.5 0.2 setosa 3.190994 2.351586
# setosa.5 5.0 3.6 1.4 0.2 setosa 3.603751 2.612921
# setosa.6 5.4 3.9 1.7 0.4 setosa 4.016508 3.073249
Edit 2
As I learned in your comment you want 1. a subset of your data along clusters. This would be ss
in FUN4
below. Then the ss
is also subsetted by leaving out one row z
over the rows of subset ss
.
FUN4 <- Vectorize(function(x, y) {
## subsets first by cluster then by row
ss <- iris[iris$Species %in% x,] ## cluster subset
sapply(1:nrow(ss), function(z) { ## subset rows using `sapply`
train <- ss[-z,] ## train data w/o row z
test <- ss[z,] ## test data for `predict`, just row z
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
}, SIMPLIFY=F)
## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
## unique cluster names
Species.u <- unique(iris$Species)
## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)
## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))
## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 1.476004 1.451029
# setosa.2 4.9 3.0 1.4 0.2 setosa 1.449120 1.431737
# setosa.3 4.7 3.2 1.3 0.2 setosa 1.426185 1.416492
# setosa.4 4.6 3.1 1.5 0.2 setosa 1.404040 1.398103
# setosa.5 5.0 3.6 1.4 0.2 setosa 1.462460 1.441295
# setosa.6 5.4 3.9 1.7 0.4 setosa 1.504990 1.559045