2

I am using the mice package in R to do multiple imputations of a dataset with a large amount of missingness. There are variables in the raw dataset that are important for the imputation process, and for later analyses. However, I want to create a correlation matrix using cor() without including some of the variables. Normally, for a simple dataset x, cor(x[,3:7]) would yield the correlation matrix for columns 3 through 7. If x is a mids object created by the mice function, one would normally use with to perform a repeated analysis to create a mira object, and then use pool to create a mipo pooled outcomes object. However, the second element of with is supposed to be a formula that references the columns of the dataset, and that is not the kind of input that goes into cor(). If x is a mids object, cor(x[,3:7]) does not work, and neither does with(x, cor(x[,3:7])).

How can I created a pooled correlation matrix for a subset of the variables from a multiple imputation data set?

#reproducible example
x = data.frame(matrix(rnorm(100),10,10))  #create random data
x[9:10,] = NA #add missingness
x.mice = mice(x)  #make imputed data set
cor(x.mice[,3:7]) #doesn't work
with(x.mice, cor(x.mice[,3:7])) #doesn't work
with(x.mice[,3:7], cor()) #doesn't work
Paul de Barros
  • 1,170
  • 8
  • 22
  • 1
    Please consider adding a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to your question: it will help us a lot helping you. – Vincent Guillemot Mar 25 '16 at 14:09
  • Good point, Vincent Guillemot. I added one to the question. Thanks. – Paul de Barros Mar 25 '16 at 14:19
  • 1
    If you name the columns in your data , its probably easier to work with. So use `x = setNames(x, letters[1:10])` at the relevant place (before calling mice) and then this should work `with(x.mice, cor(cbind(a,b)))` – user20650 Mar 25 '16 at 14:40
  • 1
    But as another thought, as there is not a `pool` method (unless you use `lm` i guess), it may be easier just to loop through the `complete` datasets... `lapply(1:5, function(ii) cor(complete(x.mice, ii)[3:7]))` – user20650 Mar 25 '16 at 14:53
  • 1
    To pool the results, I would just take the *mean* matrix, which is coincidentally also a correlation matrix. If all the matrices are in a list `L`, do `Reduce("+",L)/5` to obtain the pooled correlation matrix. – Vincent Guillemot Mar 25 '16 at 15:44

1 Answers1

1

I've had the same problem. The newly added package "miceadds" adds very useful functionality to the mice package.

Specifically, for your problem, look up the function micombine.cor which does inference for correlations and covariances for multiply imputed datasets.

Eg:

library(missForest)
library(mice)
library(miceadds)

#Get the data
data <- iris

#introduce missings
iris.mis <- prodNA(iris, noNA = 0.1)


#imputedata
imputed     <-mice(iris.mis, m = 5, maxit = 5, method = "pmm")

#correlations for the first three variables (package miceadds) 
correlations<- miceadds::micombine.cor(mi.res=iris.mis, variables = c(1:3))

#and because i am a psychologist and don't like scientific coding... 
old_school<-format(correlations$p, scientific=FALSE)
correlations["p_value"] <- NA; correlations$p_value <- old_school; 
correlations
George GL
  • 29
  • 3