2
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))



dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))

I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...

bvowe
  • 3,004
  • 3
  • 16
  • 33
  • Hi bvowe. Are you looking for a `data.table` solution or why is that tag used? – dario Feb 21 '20 at 12:11
  • @dario i was interested in data.table and dplyr because i am learning data.table and it is helpful to compare the solutions using both packages. thanks a bunch! – bvowe Feb 21 '20 at 12:18
  • Does this answer your question? [filter rows in data.table with \`by\`](https://stackoverflow.com/questions/34393053/filter-rows-in-data-table-with-by) – Cole Feb 21 '20 at 12:54

6 Answers6

3

Is this what you want?

library(dplyr)

dataHAVE %>%
    group_by(student) %>%
    filter(!all(is.na(score)))

  student  time score
    <dbl> <dbl> <dbl>
1       1     1     7
2       1     2     9
3       1     3     5
4       3     1    NA
5       3     2     3
6       3     3     9
7       5    NA     7
8       5     2    NA
9       5     3     5

Each student is only kept if not (!) all score values are NA

Georgery
  • 7,643
  • 1
  • 19
  • 52
3

Since nobody suggested one, here is a solution using data.table:

  library(data.table)
  dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                        "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                        "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

Edit:

Previous but wrong code:

dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]

New and correct code:

dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]

Returns:

   student time score
1:       1    1     7
2:       1    2     9
3:       1    3     5
4:       3    1    NA
5:       3    2     3
6:       3    3     9
7:       5   NA     7
8:       5    2    NA
9:       5    3     5

Edit:

Updatet data.table solution with @Cole s suggestion...

dario
  • 6,415
  • 2
  • 12
  • 26
  • 2
    This is somewhat a duplicate question, but here's a decent looking way: ```dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]``` See also: https://github.com/Rdatatable/data.table/issues/788 – Cole Feb 21 '20 at 12:51
  • 1
    @Cole: Thanks a bunch for taking the time and looking up (and finding!) an excellent solution! – dario Feb 21 '20 at 12:59
  • @this does not work. student 2 should be removed because all of their scores are NA – bvowe Feb 22 '20 at 22:58
  • @bvowe: You are right of course. If fixed the missplaced `&` and replaced it with a correct `|`... sorry for that – dario Feb 23 '20 at 07:32
1

Here is a base R solution using subset + ave

dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))

or

dataWANT <- subset(dataHAVE,
                   !Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

Another option:

library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

Create a dummy variable, and filter based on that

library("dplyr")

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                      "time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
                      "score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))

dataHAVE %>% 
  mutate(check=is.na(time)&is.na(score)) %>% 
  filter(check == FALSE) %>% 
  select(-check)
#>    student time score
#> 1        1    1     7
#> 2        1    2     9
#> 3        1    3     5
#> 4        2    1    NA
#> 5        2    2    NA
#> 6        2    3    NA
#> 7        3    1    NA
#> 8        3    2     3
#> 9        3    3     9
#> 10       5   NA     7
#> 11       5    2    NA
#> 12       5    3     5

Created on 2020-02-21 by the reprex package (v0.3.0)

giocomai
  • 3,043
  • 21
  • 24
0

data.table solution generalising to any number of columns:

dataHAVE[, 
         .SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)], 
         by = student]

#    student time score
# 1:       1    1     7
# 2:       1    2     9
# 3:       1    3     5
# 4:       3    1    NA
# 5:       3    2     3
# 6:       3    3     9
# 7:       5   NA     7
# 8:       5    2    NA
# 9:       5    3     5
s_baldur
  • 29,441
  • 4
  • 36
  • 69