19

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:

0-99 Data

-1 Question not asked

-5 Do not know

-7 Refused to respond

-9 Module not asked

Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.

I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235

6 Answers6

12

I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){
    Df <- sapply(x,function(i){
        i[i %in% code] <- NA
        i
    })

    id <- which(is.na(Df))
    rowid <- id %% nrow(x)
    colid <- id %/% nrow(x) + 1
    NAdf <- data.frame(
        id,rowid,colid,
        value = as.matrix(x)[id]
    )
    Df <- as.data.frame(Df)
    attr(Df,"NAcode") <- NAdf
    Df
}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA NA NA 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){
    NAval <- attr(x,"NAcode")
    for(i in which(NAval$value %in% code))
        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]

    x
}

> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA -2 -3 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.

Community
  • 1
  • 1
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 6
    And you should definitely grab some free time and write a package! =) – aL3xa Mar 17 '11 at 16:24
  • THAT is slick... Any news on this topic since March? – Matt Bannert Nov 11 '11 at 13:20
  • I have a small remark: When you start to code your NAs it is likely that one kind of NA already is NA (e.g. cause of the read data process). Hence it would be nice if the `code` list would accept NAs and not only negative and positive integer. – Matt Bannert Nov 11 '11 at 14:27
  • 1
    @ran2 No news on the topic, but I might wrap that in a package once I figured out data tables. It could be a nice extension to that one. – Joris Meys Nov 11 '11 at 23:46
  • The only annoying thing about attributes-based approaches is that many commands strip attributes, so you have to put in some work to keep them around. – Ari B. Friedman Jun 30 '13 at 10:22
  • This is great @JorisMeys I can't find a package yet unfortunately. I noticed it works for characters too but not for factors. Is there a work around for factors? – Sylv Jun 23 '23 at 02:57
6

The most obvious way seems to use two vectors:

  • Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
  • Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.

Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.

Update following questions from @gsk3

  1. Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
  2. Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
  3. now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
  4. There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.

I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.

csgillespie
  • 59,189
  • 14
  • 150
  • 185
  • This would definitely work, but I see three problems with it: 1) Data storage requirements are dramatically increased (in a survey with hundreds of variables, this is non-trivial), 2) Programs don't automatically deal with it (now you have to manipulate two vectors together every time you want to conceptually manipulate a variable), and 3) There's no standard implementation, so my solution might differ from someone else's. Might be worth writing a new class that holds a vector plus an index, but then every interesting function has to have a method for it. – Ari B. Friedman Mar 17 '11 at 14:58
  • Thanks for the update. I'm envisioning this as a large data.frame already, and data.frames can't store other data.frames. So without using a list to hold the data.frames of variables, it'd be hard to implement. I'll try to give a clearer picture of what I'm looking for later today. – Ari B. Friedman Mar 17 '11 at 16:12
  • 1
    @cgillespie: You needn't increase storage significantly if you use sparse vectors or matrices. – Iterator Nov 11 '11 at 23:24
4

This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:

  • MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
  • MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
  • MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.

IMHO this question is more suitable for CrossValidated.

But here's a link from SO that you may find useful:

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

Community
  • 1
  • 1
aL3xa
  • 35,415
  • 18
  • 79
  • 112
  • 1
    I agree that a proper accounting for missingness involves appropriately dealing with the values in your analysis, a la Rubin. I have dealt with this in different ways in different studies (e.g. multiple imputation or more ad hoc solutions), but they all start from identifying that there are different kinds of missingness. My question is about the technical aspect of keeping track of those different kinds of missingness within a dataset. – Ari B. Friedman Mar 17 '11 at 15:01
  • In that case you shouldn't use `NA` at all - like @Ralph suggested. – aL3xa Mar 17 '11 at 16:00
  • @aL3xa: But as I noted in Ralph's suggestion, NA has some nice properties in that it doesn't allow you to ignore the missing data by accident. – Ari B. Friedman Mar 17 '11 at 16:10
  • Well, adding `attr` to your data as @Joris suggested is the ultimate solution (though it's still a workaround)... boy, are my answers useless or what?! =) – aL3xa Mar 17 '11 at 16:19
  • I'm always looking at my data while I'm coding it, so I'm never really ignoring it by accident. After you get the kinks out I think it's a good idea to go to "real NA" since you may (and probably will) be processing data you have never seen before which has "new" missing values which you need to catch. – Ralph Winters Mar 17 '11 at 17:45
  • @Ralph: I try to always look at my data as well. But often when beginning an analysis I'd like a generic NA, and only later do I realize how I'd like to treat the different kinds of NA. I do everything in code, so I can always go back to the cleaning phase and deal with it there, but Stata's multiple kinds of missingness has given me a nice workflow in the past, as well as made understanding what's going on with the data easier. – Ari B. Friedman Mar 19 '11 at 18:41
  • Multiple missingness is a good feature, but can cause problems. Needs to be designed really well and be consistently coded from the source system. – Ralph Winters Mar 20 '11 at 15:49
4

You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.

-Ralph Winters

Ralph Winters
  • 297
  • 1
  • 5
  • This works. However, generally I prefer re-coding to NA so that analyses fail when I fail to account for the missing data properly. I find it's too easy to make mistakes when I've got missing data stored as normal data. – Ari B. Friedman Mar 17 '11 at 15:00
2

I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.

> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1

That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.

Allan.

Allan Engelhardt
  • 1,421
  • 10
  • 5
1

I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

Matt Bannert
  • 27,631
  • 38
  • 141
  • 207