Removing groups with all NA in Data.Table or DPLYR in R

Question

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))



dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))

I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...

Hi bvowe. Are you looking for a `data.table` solution or why is that tag used? — dario, Feb 21 '20 at 12:11
@dario i was interested in data.table and dplyr because i am learning data.table and it is helpful to compare the solutions using both packages. thanks a bunch! — bvowe, Feb 21 '20 at 12:18
Does this answer your question? [filter rows in data.table with \`by\`](https://stackoverflow.com/questions/34393053/filter-rows-in-data-table-with-by) — Cole, Feb 21 '20 at 12:54

score 3 · Accepted Answer · answered Feb 21 '20 at 12:03

3

Is this what you want?

library(dplyr)

dataHAVE %>%
    group_by(student) %>%
    filter(!all(is.na(score)))

  student  time score
    <dbl> <dbl> <dbl>
1       1     1     7
2       1     2     9
3       1     3     5
4       3     1    NA
5       3     2     3
6       3     3     9
7       5    NA     7
8       5     2    NA
9       5     3     5

Each student is only kept if not (!) all score values are NA

answered Feb 21 '20 at 12:03

Georgery

7,643
1
19
52

This misses the `time` constraint – Richard Telford Feb 21 '20 at 12:08

dario · Answer 2 · 2020-02-23T07:31:31.667

3

Since nobody suggested one, here is a solution using data.table:

  library(data.table)
  dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                        "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                        "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

Edit:

Previous but wrong code:

dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]

New and correct code:

dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]

Returns:

   student time score
1:       1    1     7
2:       1    2     9
3:       1    3     5
4:       3    1    NA
5:       3    2     3
6:       3    3     9
7:       5   NA     7
8:       5    2    NA
9:       5    3     5

Edit:

Updatet data.table solution with @Cole s suggestion...

edited Feb 23 '20 at 07:31

answered Feb 21 '20 at 12:06

dario

6,415
2
12
26

2

This is somewhat a duplicate question, but here's a decent looking way: ```dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]``` See also: https://github.com/Rdatatable/data.table/issues/788 – Cole Feb 21 '20 at 12:51
1

@Cole: Thanks a bunch for taking the time and looking up (and finding!) an excellent solution! – dario Feb 21 '20 at 12:59
@this does not work. student 2 should be removed because all of their scores are NA – bvowe Feb 22 '20 at 22:58
@bvowe: You are right of course. If fixed the missplaced `&` and replaced it with a correct `|`... sorry for that – dario Feb 23 '20 at 07:32

ThomasIsCoding · Answer 3 · 2020-02-21T12:37:48.680

1

Here is a base R solution using subset + ave

dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))

or

dataWANT <- subset(dataHAVE,
                   !Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))

edited Feb 21 '20 at 12:37

answered Feb 21 '20 at 12:02

ThomasIsCoding

96,636
9
24
81

score 1 · Answer 4 · answered Feb 21 '20 at 22:30

1

Another option:

library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]

answered Feb 21 '20 at 22:30

chinsoon12

25,005
4
25
35

score 0 · Answer 5 · answered Feb 21 '20 at 12:06

Create a dummy variable, and filter based on that

library("dplyr")

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                      "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                      "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

dataHAVE %>% 
  mutate(check=is.na(time)&is.na(score)) %>% 
  filter(check == FALSE) %>% 
  select(-check)
#>    student time score
#> 1        1    1     7
#> 2        1    2     9
#> 3        1    3     5
#> 4        2    1    NA
#> 5        2    2    NA
#> 6        2    3    NA
#> 7        3    1    NA
#> 8        3    2     3
#> 9        3    3     9
#> 10       5   NA     7
#> 11       5    2    NA
#> 12       5    3     5

^{Created on 2020-02-21 by the reprex package (v0.3.0)}

score 0 · Answer 6 · answered Feb 21 '20 at 13:48

data.table solution generalising to any number of columns:

dataHAVE[, 
         .SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)], 
         by = student]

#    student time score
# 1:       1    1     7
# 2:       1    2     9
# 3:       1    3     5
# 4:       3    1    NA
# 5:       3    2     3
# 6:       3    3     9
# 7:       5   NA     7
# 8:       5    2    NA
# 9:       5    3     5

Removing groups with all NA in Data.Table or DPLYR in R

6 Answers6

Edit:

Edit: