6

So I am using the mice package to impute missing data. I'm new to imputation so I've got to a point but have run into a steep learning curve. To give a toy example:

library(mice)
# Using nhanes dataset as example
df1 <- mice(nhanes, m=10)

So as you can see I imputed df1 10 times using mostly default settings - and I am comfortable using this result in regression models, pooling results etc. However in my real life data, I have survey data from different countries. And so levels of missings differ by country, as do the values of specific variables - i.e. age, education level etc. Therefore I would like to impute the misssings, allowing for clustering by the country. So I will create a grouping variable which has no missings (of course in this toy example the correlations with other variables are missing, but in my real data they exist)

# Create a grouping variable
nhanes$country <- sample(c("A", "B"), size=nrow(nhanes), replace=TRUE)

So how to I tell mice() that this variable is different from the others - i.e. it is a level in a multi-level dataset?

user2498193
  • 1,072
  • 2
  • 13
  • 32
  • 1
    Would running `mice` on each factor level be a good workaround? For example, `mice(nhanes[which(nhanes$country == 'A'),], m=10)` and then loop over the factors or use your favorite R's groupby operation? This of course assumes that to impute data for country `A`, one doesn't need other countries, ie they're independent. – Gene Burinsky Jun 29 '16 at 14:27
  • Well yes I did try this - and there is a function to combine the datasets 'rbind.mids(' - but I've found this functino gives me lots of warnings and errors that I could not figure out. Ultimately I thought imputing with recognition of the data structure would be better. Thanks for the suggestion – user2498193 Jun 29 '16 at 14:32

2 Answers2

6

If you're thinking clusters as in "mixed-effects" models, then you should use the methods provided by mice intended for clustered data. These methods can be found in the manual and are usually prefixed like 2l.something.

The variety of methods for clustered data is somewhat limited in mice, but I can recommend using 2l.pan for missing data in lower-level units and 2l.only.norm at the cluster level.

As an alternative to mixed-effects models, you may consider using dummy indicators to represent the cluster structure (i.e., one dummy variable for each cluster). This method is not ideal when you think of the clusters from the perspective of mixed-effects models. So if you want to do mixed-effects analyses, then stick to mixed-effects models when you can.

Below, I show an example for both strategies.

Preparation:

library(mice)
data(nhanes)

set.seed(123)
nhanes <- within(nhanes,{
  country <- factor(sample(LETTERS[1:10], size=nrow(nhanes), replace=TRUE))
  countryID <- as.numeric(country)
})

Case 1: Imputation using mixed-effects models

This section uses 2l.pan to impute the three variables with missing data. Note that I use clusterID as the cluster variable by specifying a -2 in the predictor matrix. To all other variables, I assign fixed effects only (1).

# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred1 <- imp0$predictorMatrix
meth1 <- imp0$method

# set imputation procedures
meth1[c("bmi","hyp","chl")] <- "2l.pan"

# set predictor Matrix (mixed-effects models with random intercept
# for countryID and fixed effects otherwise)
pred1[,"country"] <- 0     # don't use country factor
pred1[,"countryID"] <- -2  # use countryID as cluster variable
pred1["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred1["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred1["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp1 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred1, method=meth1)

Case 2: Imputation using dummy indicators (DIs) for clusters

This section uses pmm for imputation, and the clustered structure is represented in an "ad hoc" fashion. That is, the clustered aren't represented by random effects but by fixed effects instead. This may exaggerate the cluster-level variability of the variables with missing data, so be sure you know what you do when you use it.

# create dummy indicator variables
DIs <- with(nhanes, contrasts(country)[country,])
colnames(DIs) <- paste0("country",colnames(DIs))
nhanes <- cbind(nhanes,DIs)


# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred2 <- imp0$predictorMatrix
meth2 <- imp0$method

# set imputation procedures
meth2[c("bmi","hyp","chl")] <- "pmm"

# for countryID and fixed effects otherwise)
pred2[,"country"] <- 0     # don't use country factor
pred2[,"countryID"] <- 0   # don't use countryID
pred2[,colnames(DIs)] <- 1 # use dummy indicators
pred2["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred2["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred2["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp2 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred2, method=meth2)

If you want to read up on what to think of these methods, have a look at one or two of these papers.

SimonG
  • 4,701
  • 3
  • 20
  • 31
  • Hi SimonG. Great answer thanks it helps alot! This helps me understand alot better than the package reference does. I'm left with one question however. In case 1, why did you use -2 to indicate the cluster variable instead of +2. What is the difference in -2 & +2 - this is not explained in the help file that I can see. – user2498193 Jun 29 '16 at 23:03
  • 1
    It is explained in the [manual](https://cran.r-project.org/web/packages/mice/mice.pdf) (page 47) where they explain the `type` argument of `2l.pan`. The `type` argument describes how `mice` understands the rows in the predictor matrix (`pred1`). In all two-level functions, the variable denoted by `-2` is interpreted as the cluster variable. Those with a `1` are understood as predictors with fixed effects, `2` as predictors with random effects. The codes `3` and `4` work similar to `1` and `2` with the difference that the cluster mean is calculated and included as an additional predictor. – SimonG Jun 30 '16 at 08:45
  • Ahh brilliant thanks I missed that. Much appreciate your help! – user2498193 Jun 30 '16 at 08:49
1

You have to set up a predictorMatrix to tell mice which variable to use to impute another. A fast way in doing so is to use predictorM<-quickpred(nhanes)

Then you change the 1s in the matrix to 2 if it is a normal variable and -2 if it is the level two variable for different countries and submit it to the mice command as predictorMatrix =predictorM. In the method command you now have to set the methods to 2l.norm if it is a metric variable or 2l.binom if it is binary variable. For the latter you need the function written by Sabine Zinn (https://www.neps-data.de/Portals/0/Working%20Papers/WP_XXXI.pdf). Unfortunately it is not known to me if there methods for imputation of two level count data out there in the world.

Be aware imputing a multilevel datasets will slow down the process a lot. In my experience resampling method like PMM or in the Baboon package work well in keeping the hierarchical structure of the data and are much faster in use.

helper
  • 19
  • 1
  • Thanks for your answer. I don't quite understand about changing to 2 & -2 - which cells do I change (also not clear from the function helpfile). So the methods `2l.norm` and `2l.binom` - are these something I need to set for each variable in the data.frame that I'm imputing ? re: 'PMM' - isn't that in part of the`mice' command ? – user2498193 Jun 29 '16 at 14:39
  • 1
    Please, don't use `2l.norm`. The function is still buggy, and no one has been around to fix it so far. For continuous variables in clustered data, the `2l.pan` method should be used. – SimonG Jun 29 '16 at 20:32
  • 2
    The OP provided an example dataset, this answer would be much improved by providing a coded representation of your answer specific to the example data instead of just walking through the solution in "pseudo-code-speak". – alexwhitworth Jun 29 '16 at 23:45