15

I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.

Say I have some data that looks like this,

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"),  20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99),  10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"),  10, rep=TRUE) ); df
#                      a  b    f g
# 1              Unknown  2 0.78 M
# 2              Refused  2 0.87 M
# 3                  Red 77 0.82 Y
# 4                  Red 99 0.78 Y
# 5                Green 77 0.97 M
# 6                Green  3 0.99 K
# 7                  Red  3 0.99 Y
# 8                Green 88 0.84 C
# 9              Unknown 99 1.08 M
# 10             Refused 99 0.81 C
# 11                Blue  2 0.78 M
# 12               Green  2 0.87 M
# 13                Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15             Unknown 77 0.97 M
# 16             Refused  3 0.99 K
# 17                Blue  3 0.99 Y
# 18               Green 88 0.84 C
# 19             Refused 99 1.08 M
# 20                 Red 99 0.81 C

If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused" and 77, 88, 99) are included as regular data,

table(df$a,df$g)
#                     C K M Y
# Blue                0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green               2 1 2 0
# Red                 1 0 0 3
# Refused             1 1 2 0
# Unknown             0 0 3 0

and

table(df$b,df$g)
#    C K M Y
# 2  0 0 4 0
# 3  0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2

I now recode the three factor levels "Don't know/Not sure","Unknown","Refused" into <NA>

is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"

and remove the empty levels

df$a <- factor(df$a) 

and the same is done with the numeric values 77, 88, and 99

is.na(df) <- df=="77"|df=="88"|df=="99"

table(df$a, df$g, useNA = "always")       
#       C K M Y <NA>
# Blue  0 0 1 2    0
# Green 2 1 2 0    0
# Red   1 0 0 3    0
# <NA>  1 1 5 1    0

table(df$b,df$g, useNA = "always")
#      C K M Y <NA>
# 2    0 0 4 0    0
# 3    0 2 0 2    0
# <NA> 4 0 4 4    0

Now the missing categories are recode into NA but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused" and 77, 88, 99 as missing, but I want to be able to still have the information in the variable.

Eric Fail
  • 8,191
  • 8
  • 72
  • 128
  • How about adding another column to the `df` called `isNA` which will hold true if the value is missing? or `isNA` column can directly hold `NA` and `0`. It depends on rest of your code. – Nishanth Apr 18 '13 at 04:35
  • That would properly work, but it's more of workaround then a solution that would work *seamlessly* with the rest of my code–as you also point out. Would you care to demonstrate it in an example? – Eric Fail Apr 18 '13 at 04:46
  • It is difficult to predict the effect on rest of the code. may be you can write your own `my.table` that uses `my.is.na` which returns `TRUE` for "Don't know/Not sure","Unknown","Refused" – Nishanth Apr 18 '13 at 05:16
  • It looks like you've provided us with summarized data. Do you have the data in a format that is a step before this one? If so it would just be a matter of factoring. – Brandon Bertelsen Apr 21 '13 at 18:33
  • @BrandonBertelsen, thank you for your question (and your answer). The dummy data I've provided is quite close to how my real data looks. As I mentioned in [my comment to](http://stackoverflow.com/questions/16074384/specify-different-types-of-missing-values#comment23090546_16076252) @Maxim.K I could have been a bit more precise about the variable `a`, but aside from that the data I provided in the question is quite close to how my real data looks. – Eric Fail Apr 22 '13 at 23:37

3 Answers3

23

To my knowledge, base R doesn't have an in-built way to handle different NA types. (editor: It does: NA_integer_, NA_real_, NA_complex_, and NA_character. See ?base::NA.)

One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.

Here's an example:

First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown", 
                              "Refused", "Blue", "Red", "Green"),
                            20, replace = TRUE), 
                 b = sample(c(1, 2, 3, 77, 88, 99), 10, 
                            replace = TRUE), 
                 f = round(rnorm(n = 10, mean = .90, sd = .08), 
                           digits = 2), 
                 g = sample(c("C", "M", "Y", "K"), 10, 
                            replace = TRUE))
df2 <- df

Let's factor variable "a":

df2$a <- factor(df2$a, 
                levels = c("Blue", "Red", "Green", 
                           "Don't know/Not sure",
                           "Refused", "Unknown"),
                labels = c(1, 2, 3, 77, 88, 99))

Load the "memisc" library:

library(memisc)

Now, convert variables "a" and "b" to items in "memisc":

df2$a <- as.item(as.character(df2$a), 
                  labels = structure(c(1, 2, 3, 77, 88, 99),
                                     names = c("Blue", "Red", "Green", 
                                               "Don't know/Not sure",
                                               "Refused", "Unknown")),
                  missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b, 
                 labels = c(1, 2, 3, 77, 88, 99), 
                 missing.values = c(77, 88, 99))

By doing this, we have a new data type. Compare the following:

as.factor(df2$a)
#  [1] <NA>  <NA>  Red   Red   Green Green Red   Green <NA>  <NA>  Blue 
# [12] Green Blue  <NA>  <NA>  <NA>  Blue  Green <NA>  Red  
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
#  [1] *Unknown             *Refused             Red                 
#  [4] Red                  Green                Green               
#  [7] Red                  Green                *Unknown            
# [10] *Refused             Blue                 Green               
# [13] Blue                 *Don't know/Not sure *Unknown            
# [16] *Refused             Blue                 Green               
# [19] *Refused             Red                 
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown

We can use this information to create tables behaving the way you describe, while retaining all the original information.

table(as.factor(include.missings(df2$a)), df2$g)
#                       
#                        C K M Y
#   Blue                 0 0 1 2
#   Red                  1 0 0 3
#   Green                2 1 2 0
#   *Don't know/Not sure 0 0 0 1
#   *Refused             1 1 2 0
#   *Unknown             0 0 3 0
table(as.factor(df2$a), df2$g)
#        
#         C K M Y
#   Blue  0 0 1 2
#   Red   1 0 0 3
#   Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#        
#         C K M Y <NA>
#   Blue  0 0 1 2    0
#   Red   1 0 0 3    0
#   Green 2 1 2 0    0
#   <NA>  1 1 5 1    0

The tables for the numeric column with missing data behaves the same way.

table(as.factor(include.missings(df2$b)), df2$g)
#      
#       C K M Y
#   1   0 0 0 0
#   2   0 0 4 0
#   3   0 2 0 2
#   *77 0 0 2 2
#   *88 2 0 0 0
#   *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#       
#        C K M Y <NA>
#   1    0 0 0 0    0
#   2    0 0 4 0    0
#   3    0 2 0 2    0
#   <NA> 4 0 4 4    0

As a bonus, you get the facility to generate nice codebooks:

> codebook(df2$a)
========================================================================

   df2$a

------------------------------------------------------------------------

   Storage mode: character
   Measurement: nominal
   Missing values: 77, 88, 99

            Values and labels    N    Percent 

    1   'Blue'                   3   25.0 15.0
    2   'Red'                    4   33.3 20.0
    3   'Green'                  5   41.7 25.0
   77 M 'Don't know/Not sure'    1         5.0
   88 M 'Refused'                4        20.0
   99 M 'Unknown'                3        15.0

However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.

Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    +1 very good detailed answer! I like the '*' in the rownames when `include.missings` :) – agstudy Apr 21 '13 at 11:21
  • Thank your for a good detailed answer, as @agstudy also points out. – Eric Fail Apr 22 '13 at 23:29
  • +1 really detailed, nice. R does have a way to handle different NA types, but I don't know if you can make use of it. It must do to be able to do `class( c(1,2,NA) )` which is `"numeric"` and `class( c("a","b",NA) )` which is `"character"`? – Simon O'Hanlon Apr 23 '13 at 14:53
  • What other packages let you use different kind of missings simultaneously? I have a dataset with many variables, some numeric, some dates, and I want to code three different kind of missings: errors, unknown and missings generated because of the reshaping of the data. – skan May 01 '17 at 10:24
5

To retain the original values, you can create new columns where you code the NA information , for example :

df <- transform(df,b.na = ifelse(b %in% c('77','88','99'),NA,b))
df <- transform(df,a.na = ifelse(a %in% 
                        c("Don't know/Not sure","Unknown","Refused"),NA,a))

Then you can do something like this :

   table(df$b.na , df$g)
    C K M Y
  2 0 0 4 0
  3 0 2 0 2

Another option without creating new columns is to use ,exclude option like this , to set the non desired values to NULL,( different of missing values)

table(df$a,df$g,
      exclude=c('77','88','99',"Don't know/Not sure","Unknown","Refused")) 
       C K M Y
  Blue  0 0 1 2
  Green 2 1 2 0
  Red   1 0 0 3

You can define some global constants( even it is not recommnded ) to group your "missing values", and use them in the rest of your program. Something like this :

B_MISSING <- c('77','88','99')
A_MISSING <- c("Don't know/Not sure","Unknown","Refused")
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    Thank you for responding to my question. I didn't know about the `exclude` option. That is an interesting solution. I'm still somewhat surprised that R only have one category of missing values. – Eric Fail Apr 18 '13 at 06:48
  • 2
    @EricFail R have one missing are basically a logical values but can also have different types: `NA_integer_, NA_real_, NA_complex_ and NA_character_`. You can see my edit for a "global" solution. – agstudy Apr 18 '13 at 06:57
  • 8
    Strictly speaking, these are not (all) missings. "Don't know" is not a missing, it is a valid answer category, and in many cases should be treated as such. "Refused" also contains information, whereas "Unknown" is probably a true missing. I would just create an additional column with these three subcategories and refer to them whenever I needed, while using regular NA for statistical techniques that don't differentiate. – Maxim.K Apr 18 '13 at 07:33
  • @Maxim.K, your comment made me realize that I could have been more precise in my question. The variable `a` in my example should have been more like this `c("Unknown", "Refused", 1, 1, 2, 2, 1, 2, "Unknown", "Refused", 3, 2, 3, "Don't know/Not sure", "Unknown", "Refused", 3, 2, "Refused", 1)` and what I am interested in is storing `a` in a way where I can summarize it, but without losing the distinction between "Don't know/Not sure","Unknown","Refused." Does that make sense? – Eric Fail Apr 22 '13 at 23:19
  • @agstudy, regarding the _global constants_, would this be part of my .Rprofile? – Eric Fail Apr 22 '13 at 23:46
  • @EricFail, should variable "a" be numeric? categorical? factor? – A5C1D2H2I1M1N2O1R2T1 Apr 23 '13 at 06:05
  • @AnandaMahto, in the example in [the initial question](http://stackoverflow.com/questions/16074384/specify-different-types-of-missing-values) `a` is a factor. In [the comment above](http://stackoverflow.com/questions/16074384/specify-different-types-of-missing-values/16076252?noredirect=1#comment23090546_16076252) it's a character variable. It can be anything, if it helps answer the question. – Eric Fail Apr 23 '13 at 06:13
  • @EricFail, in that case, you can try modifying what I've shared as follows: `df2$a <- factor(df2$a, levels = c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown"), labels = c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown")); df2$a <- as.item(as.character(df2$a), labels = structure(c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown"), names = c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown")), missing.values = c("Don't know/Not sure", "Refused", "Unknown"))`. Hope that helps. – A5C1D2H2I1M1N2O1R2T1 Apr 23 '13 at 06:17
5

If you are willing to stick to numeric values then NA, Inf, -Inf, and NaN could be used for different missing values. You can then use is.finite to distinguish between them and normal values:

x <- c(NA, Inf, -Inf, NaN, 1)
is.finite(x)
## [1] FALSE FALSE FALSE FALSE  TRUE

is.infinite, is.nan and is.na are also useful here.

We could have a special print function that displays them in a more meaningful way or even create a special class but even without that the above would divide the data into finite and multiple non-finite values.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341