13

I have a large R data.table with a multi column key, where some value columns contain some NA. I'd like to remove groups that are entirely NA in one or more value columns, but keep the whole group otherwise. Repeating this for each column of the key.

To give a simplified example:

library(data.table)
DT = data.table(
    Series = rep(letters[1:12], each = 3), 
    Id = 1:12,
    Value1 = c(1:3, NA, 5:9, rep(NA,3), 1:3, NA, 5:9, rep(NA,3), 1:3, NA, 5:9, rep(NA,3)), 
    Value2 = c(rep(NA,3), 1:4, NA, 6:9, rep(NA,3), 1:9, 1:9, rep(NA,3)))
DT
    Series Id Value1 Value2
 1:      a  1      1     NA
 2:      a  2      2     NA
 3:      a  3      3     NA
 4:      b  4     NA      1
 5:      b  5      5      2
 6:      b  6      6      3
 7:      c  7      7      4
 8:      c  8      8     NA
 9:      c  9      9      6
10:      d 10     NA      7
11:      d 11     NA      8
12:      d 12     NA      9
13:      e  1      1     NA
14:      e  2      2     NA
15:      e  3      3     NA
16:      f  4     NA      1
17:      f  5      5      2
18:      f  6      6      3
19:      g  7      7      4
20:      g  8      8      5
21:      g  9      9      6
22:      h 10     NA      7
23:      h 11     NA      8
24:      h 12     NA      9
25:      i  1      1      1
26:      i  2      2      2
27:      i  3      3      3
28:      j  4     NA      4
29:      j  5      5      5
30:      j  6      6      6
31:      k  7      7      7
32:      k  8      8      8
33:      k  9      9      9
34:      l 10     NA     NA
35:      l 11     NA     NA
36:      l 12     NA     NA
    Series Id Value1 Value2

So I would like to drop:

  • Series: a,d,e,h and l
  • Ids: 4, 10,11 and 12

Correct result should look like:

    Series Id Value1 Value2
 1:      b  5      5      2
 2:      b  6      6      3
 3:      c  7      7      4
 4:      c  8      8     NA
 5:      c  9      9      6
 6:      f  5      5      2
 7:      f  6      6      3
 8:      g  7      7      4
 9:      g  8      8      5
10:      g  9      9      6
11:      i  1      1      1
12:      i  2      2      2
13:      i  3      3      3
14:      j  5      5      5
15:      j  6      6      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9
    Series Id Value1 Value2

What I managed so far:

I can find the Series that are NA for Value1 like this:

DT[, sum(1-is.na(Value1)) == 0, by = Series][V1 == TRUE]

And I could even do

setkey(DT, Series)
DT = DT[DT[, sum(1-is.na(Value)) == 0, by = Series][V1 != TRUE]]

But now I end up with V1 appearing in the final table.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Corvus
  • 7,548
  • 9
  • 42
  • 68
  • 2
    why not just `na.omit(DT)`? – Justin Jan 16 '13 at 16:11
  • `na.omit` is just removing all the NAs. I only want to remove when the entire "block" is NA. So in the example, there are no value for m or z, so they should be dropped, but I don't want to drop the two NAs in series b – Corvus Jan 16 '13 at 16:14
  • Then @Arun has your answer, but you shouldn't need `as.logical`. – Justin Jan 16 '13 at 16:30
  • Interesting, `DT[, .SD[all(!is.na(Value))], by=Series]` works fine. – Justin Jan 16 '13 at 16:34
  • 2
    Just did a large edit to try make the question clearer and use a better/simpler example. From Arun's answer is looks like .SD is the way to go, presumable a logical or can sort out multiple Value columns, but how do I do multiple key columns. Is the only way to do them one at a time? (Since I can't have two keys at once) – Corvus Jan 16 '13 at 16:45
  • @Corone Great edit, +1. I edited the first paragraph a little further. Is that what you mean? – Matt Dowle Jan 16 '13 at 19:07
  • Yes thanks @Matthew that's exactly what I meant – Corvus Jan 17 '13 at 08:29
  • @Carone Great. I edited Arun's answer to make it a full solution. – Matt Dowle Jan 17 '13 at 09:49

2 Answers2

10

You can do this to get those entries where not ALL Value are NA:

setkey(DT, "Series")
DT[, .SD[(!all(is.na(Value)))], by=Series]

The parens around !all are needed to avoid not-join syntax which Matthew will look into (see comments). Same as this :

DT[, .SD[as.logical(!all(is.na(Value)))], by=Series]

Building on that to answer the new clarified question :

allNA = function(x) all(is.na(x))     # define helper function
for (i in c("Id","Series"))
    DT = DT[, if (!any(sapply(.SD,allNA))) .SD else NULL, by=i]
DT
    Series Id Value1 Value2
 1:      i  1      1      1
 2:      i  2      2      2
 3:      i  3      3      3
 4:      b  5      5      2
 5:      b  6      6      3
 6:      f  5      5      2
 7:      f  6      6      3
 8:      j  5      5      5
 9:      j  6      6      6
10:      c  7      7      4
11:      c  8      8     NA
12:      c  9      9      6
13:      g  7      7      4
14:      g  8      8      5
15:      g  9      9      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9

That changes the order, though. So isn't precisely the result requested. The following keeps the order and should be faster too.

# starting fresh from original DT in question again
DT[,drop:=FALSE]
for (i in c("Series","Id"))
    DT[,drop:=drop|any(sapply(.SD,allNA)),by=i]
DT[(!drop)][,drop:=NULL][]
    Series Id Value1 Value2
 1:      b  5      5      2
 2:      b  6      6      3
 3:      c  7      7      4
 4:      c  8      8     NA
 5:      c  9      9      6
 6:      f  5      5      2
 7:      f  6      6      3
 8:      g  7      7      4
 9:      g  8      8      5
10:      g  9      9      6
11:      i  1      1      1
12:      i  2      2      2
13:      i  3      3      3
14:      j  5      5      5
15:      j  6      6      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Arun
  • 116,683
  • 26
  • 284
  • 387
  • +1 `!` = not join, but I'm not sure why it wouldn't work (shouldn't make a difference for logical, may be a prob of not join in combination with recycling, perhaps) – Matt Dowle Jan 16 '13 at 16:35
  • 1
    Okay, now I understand. `.SD[(!all(is.na(Value)))]` (with the paranthesis) works! – Arun Jan 16 '13 at 16:36
  • 1
    parens does the same to stop the not-join interpretation of `!`. When you say didn't work, what you mean, and which version of data.table please? Thanks. – Matt Dowle Jan 16 '13 at 16:39
  • Thanks, that works much better than what I had, but any idea how to extend it to multiple columns? – Corvus Jan 16 '13 at 16:42
  • `setkey(DT,Series)[unique(DT[,na.omit(Value),by=Series][,Series]),]` does work too. But I can't be bothered to create a big data.table for benchmarking. – Roland Jan 16 '13 at 16:50
  • @Arun Thanks a lot. That error message looks strange. A side issue I realise, but I'll take a look. – Matt Dowle Jan 16 '13 at 18:51
9

What about using complete.cases function ?

DT[complete.cases(DT),]

It will drop the rows that have a column value with NA

> DT[complete.cases(DT),]
    Series Id Value1 Value2
 1:      b  4      4      1
 2:      b  5      5      2
 3:      b  6      6      3
 4:      c  7      7      4
 5:      c  8      8      5
 6:      c  9      9      6
 7:      f  4      4      1
 8:      f  5      5      2
 9:      f  6      6      3
10:      g  7      7      4
11:      g  8      8      5
12:      g  9      9      6
13:      j  4      4      1
14:      j  5      5      2
15:      j  6      6      3
16:      k  7      7      4
17:      k  8      8      5
18:      k  9      9      6
Gago-Silva
  • 1,873
  • 4
  • 22
  • 46
  • Nice function, useful to know, but I want a "notempty.cases" function - Only drop those cases that are completely empty for one column, and leave the other NAs in. – Corvus Jan 16 '13 at 16:47
  • 4
    ok, then you should maybe change the dataset example, because all variables that have NA values, have NA values for the 3 rows of the same letter (in both Value1 and value2) – Gago-Silva Jan 16 '13 at 16:52