Imputing values in all columns of data.frame with mice

Question

I am trying to impute values using a linear model using mice. My understanding of mice is that it iterates over the rows. For a column with NAs it is using all other columns as predictors, fits the model, and then samples from this model to fill up the NAs. Here is an example where I generate some data, and than introduce missing data using ampute.

    n <- 100
    xx<-data.frame(x = 1:n + rnorm(n,0,0.1), y =(1:n)*2 + rnorm(n,0,1))
    head(xx)
    res <- (ampute(xx))
    head(res$amp)

The missing data looks like:

            x         y
   1       NA  3.887147
   2 2.157168        NA
   3 2.965164  6.639856
   4 3.848165  8.720441
   5       NA 11.167439
   6       NA 12.835415

Then I am trying to impute the missing data:

   mic <- mice(res$amp,diagnostics = FALSE )

And I would expect that then there is non, but there are NA always in one of the columns.

 colSums(is.na(complete(mic,1)))

And in which of the two it is rather random.

By running the code above I am getting:

 > colSums(is.na(complete(mic,1)))
  x  y 
  0 30

but also :

 > colSums(is.na(complete(mic,1)))
  x  y 
 33  0

I am unsure what is exactly your question. What would you like to do? — user3507584, Jul 13 '17 at 08:46

score 1 · Accepted Answer · answered Jul 13 '17 at 21:38

I tried to run your code and end up with the same type of problem:

library(mice)
n <- 100
xx<-data.frame(x = 1:n + rnorm(n,0,0.1), y =(1:n)*2 + rnorm(n,0,1))
head(xx)
res <- (ampute(xx))
head(res$amp)

if you look at the summary from the mice call then you get an indication that something is wrong. My data gives

tempData <- mice(res$amp,m=5,maxit=50,seed=500)
summary(tempData)
Multiply imputed data set
Call:
mice(data = res$amp, m = 5, maxit = 50, seed = 500)
Number of multiple imputations:  5
Missing cells per column:
 x  y 
21 23 
Imputation methods:
    x     y 
"pmm" "pmm" 
VisitSequence:
x 
1 
PredictorMatrix:
   x  y
x  0  0
y  0  0
Random generator seed value:  500

There are two indicators here. One is VisitSequence which shows that only the first column is visited, x, and not column y. Also, the PreditorMatrix only contains zeros in the off-diagonal so none of the predictors use information from any of the other predictors.

The problem is in your simulated data because the two columns are too colinear, and a similar solution is given in this detailed answer. Because the y column is essentially twice the value of the x column it is silently discarded from the analysis.

Try to simulate data that are not almost perfectly linear and it will work. For example a quadratic relationship

n <- 100
xx<-data.frame(x = 1:n + rnorm(n,0,0.1), y =(1:n)**2 + rnorm(n,0,1))
head(xx)
res <- (ampute(xx))
head(res$amp)

Imputing values in all columns of data.frame with mice

1 Answers1