-1

I have a description of a dataset stating the following:

  • Non applicable (NA) is coded as 88888
  • Missing data is coded as blank or 99999

I thought NA is the equivalent to missing data, which would make the above-mentioned equivalent. Or does non applicable stand for impossible values (e.g., dividing by zero)? If they are treated as equivalent, I could simply convert 88888, 99999 and blanks to NA. Thanks for your help.

Patrick Balada
  • 1,330
  • 1
  • 18
  • 37
  • 1
    Dear downvoter. Downvoting etiquette suggests you leave a constructive comment with your downvote. Otherwise the OP has nothing but a downvote to learn from. – Brandon Bertelsen Apr 22 '17 at 21:13
  • 2
    It's hard to tell without knowing specifics of your dataset. If the description tells you the NA and missing are coded differently, there is likely a reason for it. It's probably not the best idea to just code them the same. – Mike H. Apr 22 '17 at 21:25
  • @ Mike H. I totally agree and that's why I am suspicious. However, this is the only information provided in the data set description. But how could I distinguish them? The true missing ones would miss randomly but the "not applicable" might miss not randomly. – Patrick Balada Apr 22 '17 at 21:31
  • 1
    You're asking a question about data cleaning without posting any data so it's hard to say. For example if the column is numeric vs character there might be different approaches. A reproducible example with some data and a specific question would be helpful http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Mike H. Apr 22 '17 at 21:46

2 Answers2

1

Yes, NA is the equivalent to "empty cell"

You can do this to reduce the file size

data$column_with_8888s = ifelse(data$column_with_8888s == 8888, NA, data$column_with_8888s)
pachadotdev
  • 3,345
  • 6
  • 33
  • 60
1

Missing data is NA in R. You will have to set or calculate your own bases on questions when looking at summary data if you need a slightly different definition of missing in this context.

Sometimes it can be useful to get counts of these BEFORE converting them.

not_applicables <- lapply(mydata, function(x) sum(x == "88888") 
missings <- lapply(mydata, function(x) sum(x == "999999"))

mydata <- as.data.frame(lapply(mydata, function(x) {
     x[x %in% c('88888','99999') <- NA
     x 
}

Not applicable, is likely a category that you may not wish to exclude, because if you convert to factor, summary will not distinguish between them.

Alternatively, you may be able to fix this right when you load your data:

read.csv(..., na.strings = c("88888","99999"))
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255