0

So I have a dataset that contains a lot of missing values. I want to separate the data of different missing patterns. I found the package 'mice' which is very handy in summarizing the missing value patterns. However, when I want to select the rows with a certain missing pattern, the number of selected rows is much fewer than the count as missing pattern matrix suggests.

My code is as follows.

To get the missing pattern:

library(mice)
# md.pattern returns a matrix, I convert the matrix into a data frame with the first column as its frequency in the data frame 
pattern = md.pattern(data)
freq = dimnames(pattern)[[1]][-nrow(pattern)] 
pattern = data.frame(pattern[1:nrow(pattern)-1, 1:ncol(pattern)-1], row.names = NULL)
pattern$freq = freq
pattern = pattern[order(freq,decreasing = TRUE),]

However, when I try to count the missing patterns manually by a specific pattern in the pattern. The count is much smaller.

count = 0
for (i in 1:nrow(data)){
    # match the missingness by the entire row
    if (all(!is.na(data[i, names(data)[1:ncol(pattern)-1]]) == test[1,1:ncol(pattern)-1])){
        count = count +1
  }
}

Does anyone have an idea where goes wrong? Thanks!

The data has a lot of variables(107 in total) and 70000+ observations. This code works well in the sample data nhanes in the mice package. But it just goes wrong in my data file.

For example:

V1 V2 V3 V4 V5
1  NA  3  5  2
NA  3 23  2  9
NA  3 90  7  5
3   3  2 34 NA
3  NA  2  1  3
4  NA  7  3  1
StatCC
  • 285
  • 3
  • 11
  • You have to provide some sample data for us to play with that is representative of your real data. At the moment, we have nothing to base any suggestions on. – thelatemail Oct 23 '15 at 04:50
  • @thelatemail I have uploaded a sample of the data file. Thanks! – StatCC Oct 23 '15 at 06:02
  • 2
    *"Provide some sample data"* does not mean *"give us a link to a file of unknown origin so that we can click on it, see what format it is in, infer where the problems are, etc"*. Please reduce the problem to a small dataset and add that directly to this question. (This also helps somebody who might benefit from this question when the link you provide goes stale.) – r2evans Oct 23 '15 at 06:33
  • ... and by *"add to this question"*, I suggest something like `dput(myvar)` or code use to actually create the data (e.g., a call to `data.frame`). – r2evans Oct 23 '15 at 06:35
  • 1
    Please consider reading this: [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in r. – Heroka Oct 23 '15 at 06:43

1 Answers1

3

Anyway, I checked the original code for md.pattern in mice package. It's based on Schafer's prelim.norm function, not row-by-row checking missing value pattern.

I found the count in plyr package really does the trick. I wrote this function to return the top n missing patterns in the dataset. x is the data frame. It works well in my case.

library(plyr)
miss.pattern <- function(x, topn) {
  # find missingness patterns, 1 represents missing
  r <- 1 * data.frame(is.na(x))
  pattern <- data.frame(count(r))
  pattern <- pattern[order(-pattern$freq),]
  return(pattern[1:topn,])
}
StatCC
  • 285
  • 3
  • 11