Coding Missing Data in R

Question

I have a dataset where inspection of the data shows some of the following, all of which should be missing

'missing'
'unknown'
'uncoded'

Am I correct in thinking that I can just replace all occurrences of these with "NA" ? Is this the preferred way of doing it ?

var[var=='missing'] <- NA
var[var=='unknown'] <- NA
var[var=='uncoded'] <- NA

The "preferred" way most likely depends on the analysis you're trying to do, and so you might get a lot of opinion answers. Perhaps rephrasing the question to something like "How can R represent missing data" might help. However, the answer to that question is probably already out there. — BenBarnes, Jul 04 '12 at 09:09
Thanks, but I thought that missing data was always represented as NA ? I've always just dealt with data that was already coded as NA, but this time I have these other codings. — Joe King, Jul 04 '12 at 09:17
Ah, I didn't know whether you were referring to the preferred way of recoding or the preferred way of dealing with missing values. If you're sure you want to code them as `NA`, then you could also consider `NA_character_`, `NA_integer_` etc (listed under `?"NA"`) — BenBarnes, Jul 04 '12 at 09:23
BenBarnes , sorry. To clarify, I am referring to the way of coding missing data. Actually I am going to use the `mice` package to impute these missing values. I am teaching myself R and don't have very much experience yet. — Joe King, Jul 04 '12 at 09:29
Of interest: http://stackoverflow.com/questions/5335745/how-do-i-handle-multiple-kinds-of-missingness-in-r — Ari B. Friedman, Jul 04 '12 at 10:23

score 6 · Accepted Answer · answered Jul 04 '12 at 11:00

What you show is feasible, but you can simplify your code to a single call doing the comparison via the %in% binary operator. Here is an example using some dummy data:

set.seed(1)
var <- factor(sample(c("missing","unknown","uncoded", 1:4), 100, replace = TRUE))

This gives us a factor vector like this:

> head(var)
[1] unknown uncoded 2       4       unknown 4      
Levels: 1 2 3 4 missing uncoded unknown
> table(var)
var
      1       2       3       4 missing uncoded unknown 
     14      15      17      13      10      18      13

To set all those values coded as any of c("missing","unknown","uncoded") to NA, we do it in a single shot:

var2 <- var ## copy for demo purposes, but you can over write if you wish
var2[var2 %in% c("missing","unknown","uncoded")] <- NA

which gives

> var2[var2 %in% c("missing","unknown","uncoded")] <- NA
> head(var2)
[1] <NA> <NA> 2    4    <NA> 4   
Levels: 1 2 3 4 missing uncoded unknown
> table(var2)
var2
      1       2       3       4 missing uncoded unknown 
     14      15      17      13       0       0       0

Notice how the original levels are preserved. If you want to remove those levels then we can apply the droplevels() function to var2:

var2 <- droplevels(var2)

which gives

> head(var2)
[1] <NA> <NA> 2    4    <NA> 4   
Levels: 1 2 3 4
> table(var2)
var2
 1  2  3  4 
14 15 17 13

Also note that by default the NA are not shown in the tabular output, but we rectify that to show you that they are still there:

> table(var2, useNA = "ifany")
var2
   1    2    3    4 <NA> 
  14   15   17   13   41

score 4 · Answer 2 · answered Jul 04 '12 at 10:56

4

The general idea of replacing them with NA is correct.

You can use recode if you want to do it in a single line:

library(car)
var <- recode( var, "c('missing','unknown','uncoded')=NA" )

answered Jul 04 '12 at 10:56

Ari B. Friedman

71,271
35
175
235

Coding Missing Data in R

2 Answers2