18

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:

> a = c(1,"",NA)
> x <- data.table(a);x
    a
1:  1
2:   
3: NA
> y <- x[a!=""];y
   a
1: 1

After looking at ?`!=`, I found a one liner that works, but it's a pain:

> z <- x[!sapply(a,function(x)identical(x,""))]; z
    a
1:  1
2: NA

I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:

>     drop_these <- function(these,where){
+         argh <- !sapply(where,
+             function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+         )
+         if (is.matrix(argh)){argh <- apply(argh,2,all)}
+         return(argh)
+     }
>     x[drop_these("",a)]
    a
1:  1
2: NA
>     x[drop_these(c(1,""),a)]
    a
1: NA

I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:

> w <- data.frame(a,stringsAsFactors=F); w
     a
1    1
2     
3 <NA>
> d <- w[a!="",,drop=F]; d
      a
1     1
NA <NA>
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Ok, I just found a better way, `x[is.na(a)|a!=""]`, but that's also pretty cumbersome, and I suspect that I'm approaching this wrong. – Frank Apr 25 '13 at 18:15
  • My substantive question has been resolved by the first two answers. I still wonder if there is a reason for different behavior here than with standard data.frames. It seems that Matthew has designed it to only show rows that evaluate to True, while data.frames show rows that evaluate to True or (logical) NA...? – Frank Apr 25 '13 at 18:42
  • 5
    btw, noticed you wrote `x <- data.table(a);x`. nice little thing, you can just write `(x <- data.table(a))` and r will print out x when the command os over. – haki Apr 25 '13 at 18:48
  • @Frank you prefer the `data.frame` way of printing `NA` rows?? I've always thought that output was terrible! – eddi Apr 25 '13 at 18:50
  • @haki: Oh, right; forgot about that. Thanks. – Frank Apr 25 '13 at 18:57
  • @eddi: No, I haven't used data.frames in ages. :) I'm just curious if there's a deeper reason the difference in behavior. I only stumbled across the way data.frames work when doing my due diligence on this question. – Frank Apr 25 '13 at 18:59
  • @Frank well, I don't think there is any deeper reason besides `data.frame` doing smth silly and counterproductive :) – eddi Apr 25 '13 at 19:10
  • @Frank, I think I know why this is happening. I'll make an edit to my answer, in case you're interested. – Arun Apr 25 '13 at 20:08
  • @Arun, Yes, I'd be interested. (Let me know if spamming the comments is bad etiquette; I see on meta that there is not/never will be a PM option, so...) – Frank Apr 25 '13 at 20:34
  • @Frank, I've edited. Hope it's not too difficult to follow and I've explained as good as I think I've :). In any case, I think there may be some inconsistencies (which you've pointed out). I'll try checking some things when I find time and then post if I find something important to be notified. Thanks again for the nice post! – Arun Apr 25 '13 at 21:01

3 Answers3

18

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA

To find out why this is happening in data.table:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why dt[a != ""] doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does x[!(a== "")] work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Wow, I never tried that. I just figured `==` and `%in%` worked the same way. I guess this has something to do with a redefinition of `==` in the `i` argument of a data.table? – Frank Apr 25 '13 at 18:35
  • Thanks! Very interesting. The one step I still don't follow is when "`NA` gets replaced by `FALSE`", but I may just need to read your answer and the code more carefully. Am I right to read this as follows: "when `i=!(arg)` where `arg` evaluates to logical, it finds rows satisfying `arg` and returns only the complementary rows"? Also, is there any other type of `i` argument (besides a negated logical) for which I should know/consider whether or not it transforms into a `call` after `substitute` is applied? I'm not all that familiar with `substitute`. Thanks again! – Frank Apr 25 '13 at 21:20
  • 1
    In case you've already read my last comment, I now understand why/when NA gets replaced by False. Got it. – Frank Apr 25 '13 at 21:26
  • good stuff, except I would say that `d[a != ""]` works as expected and `d[!(a == "")]` doesn't – eddi Apr 25 '13 at 21:44
  • @Arun I basically think that `subset(df, a != "")` (and `subset(df, !(a == ""))`), where `df` is a `data.frame` has it right, and `df[df$a != "",]` and `df[!(df$a != "")]` don't - what the hell is a row with row name `NA` and where does it come from (in principle, not in how the details work). – eddi Apr 25 '13 at 22:47
  • @eddi, `NA` is logical. So, the result makes sense. It returns `NA_character_`. I mean `[.data.frame` does what it's supposed to do. In any case, I'll write back after giving this some more thought. Good night! – Arun Apr 25 '13 at 22:54
  • @Arun how does this make sense: `df = data.frame(a = c("1", NA), b = c(1:2)); df[df$a == "1",]; df[df$a != "1",]`?? DO note the row name and the `b` value in both expressions. – eddi Apr 25 '13 at 22:57
  • @eddi, because `?NA` states this clearly: `Computations using NA will normally result in NA: a possible exception is where NaN is also involved, in which case either might result.` `NA` is logical. And any operation on it (as said just above) will result in `NA`. – Arun Apr 26 '13 at 06:47
  • The only thing I am wondering is why does subset evaluate the expression and replace NA by FALSE? Where as, `[.data.frame` doesn't. – Arun Apr 26 '13 at 06:57
  • @Arun all you're saying is that `data.frame` doesn't bother to return sensible results, when user supplies nonsensical expression. `subset` on the other hand does. – eddi Apr 26 '13 at 12:32
  • @eddi, I'm not going to get into another argument with you. Suffice to say, the behaviour of NA (and computations using NA) is quite clear to me. I *understand* why `[.data.frame` gives what it gives. I don't care if it makes sense to you or not (once again, it's the documentation that counts). What's unclear to me is why is there a difference in the results between `subset` and `[.data.frame`. – Arun Apr 26 '13 at 12:34
  • @Arun documented bad behavior is still bad behavior, not that this one is documented, as there was no obvious computation involving e.g. the `b` column in the expressions I wrote – eddi Apr 26 '13 at 12:40
  • @eddi, join here: http://chat.stackoverflow.com/rooms/28979/discussion-between-eddi-and-arun – Arun Apr 26 '13 at 12:41
  • 1
    @eddi, the only problem is that it *isn't* bad behaviour if you understand the role of NA logical. So, why don't you go ahead and write this "bad behaviour" on R-devel? – Arun Apr 26 '13 at 13:04
  • @eddi: `?[` (use back ticks before and after `[`). Check the part about `NAs in indexing`: *When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA in the corresponding element of a logical, integer, numeric, complex or character result, and NULL for a list. (It returns 00 for a raw result.)* – Arun Apr 26 '13 at 13:30
  • @Arun re r-devel: I might when I get more time; maybe once the datatable issue we discussed before is closed (see here: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-quot-by-quot-td4664770.html ) – eddi Apr 26 '13 at 14:38
  • @eddi, thanks for the link. Do read `?[` as well. Also, check [**this post**](http://stackoverflow.com/questions/16239153/indexing-dtx-and-dtx-gives-different-results-in-data-table) of mine where I have asked about the inconsistent behaviour of data.table (and do weigh in). – Arun Apr 26 '13 at 14:43
5

Background answer from Matthew :

The behaviour with != on NA as highlighted by this question wasn't intended, thinking about it. The original intention was indeed to be different than [.data.frame w.r.t. == and NA and I believe everyone is happy with that. For example, FAQ 2.17 has :

DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]

That convenience is achieved by dint of :

DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows for each NA

The motivation is not just convenience but speed, since each and every !, is.na, & and == are themselves vector scans with associated memory allocation of each of their results (explained in intro vignette). So although x[is.na(a) | a!=""] is a working solution, it's exactly the type of logic I was trying to avoid needing in data.table. x[!a %in% ""] is slightly better; i.e, 2 scans (%in% and !) rather than 3 (is.na, | and !=). But really x[a != ""] should do what Frank expected (include NA) in a single scan.

New feature request filed which links back to this question :

DT[col!=""] should include NA

Thanks to Frank, Eddi and Arun. If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where both colA and colB are NA (which it doesn't currently, I believe).

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Matthew, `DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]` is not really necessary/the only way. One could write `DF[which(DF$ColA == DF$ColB), ]`. `NA`'s interpretation of FALSE goes against much of R's "statistical aspects" to data analysis (unknown/missing data), wouldn't you say? Although, probably this is more appropriate on the mailing list. – Arun Jun 09 '13 at 19:54
  • I just tested `system.time(DF[which(DF$V1 == DF$V2), ])` and `system.time(DT[V1 == V2])` on `set.seed(45); DF <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2)); DT <- data.table(DF)` and `which` seems to be faster... Any ideas? – Arun Jun 09 '13 at 20:09
  • Even `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. – Arun Jun 09 '13 at 20:13
  • @Arun Interesting thanks, [FR#4652](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4652&group_id=240&atid=978) is now filed to address that speed issue. Have replied on mailing list about the first point here : http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-tp4669097.html – Matt Dowle Jun 09 '13 at 22:19
3

As you have already figured out, this is the reason:

a != ""
#[1]  TRUE    NA FALSE

You can do what you figured out already, i.e. x[is.na(a) | a != ""] or you could setkey on a and do the following:

setkey(x, a)
x[!J("")]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Ah, the `setkey` approach is appealing, too, but I prefer to use `%in%`, as Arun suggested. Thanks. Along the same lines, I just realized that `x[!(a=="")]` also works. – Frank Apr 25 '13 at 18:36