0

I have a dataframe with 1000 IDs, each with > 100 rows of data. I want to remove all IDs that meet a criteria based on another column at least once.

As an example with the dummy data below, I want to remove all IDs, where var2 is <20 at least once.

How do I do this without spelling out each individual ID to be dropped?

dummy data of similar structure:

data <- data.frame(ID = rep(c('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10'), each = 5),
                  var1 = rep(c('a', 'b', 'b', 'c', 'd','a', 'c', 'c', 'b', 'a' ), times = 5),
                  var2 = sample(1:100, 50))

I have tried using the function droplevel, but I do not want to spell out every individual ID to be dropped.

user303287
  • 131
  • 5

2 Answers2

2

tidyverse

df <- data.frame(ID = rep(c('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10'), each = 5),
                   var1 = rep(c('a', 'b', 'b', 'c', 'd','a', 'c', 'c', 'b', 'a' ), times = 5),
                   var2 = sample(1:100, 50))

library(tidyverse)
df %>% 
  group_by(ID) %>% 
  filter(!any(var2 < 20)) %>% 
  ungroup()

#> # A tibble: 25 x 3
#>    ID    var1   var2
#>    <chr> <chr> <int>
#>  1 B2    a       100
#>  2 B2    c        67
#>  3 B2    c        64
#>  4 B2    b        78
#>  5 B2    a        73
#>  6 B3    a        83
#>  7 B3    b        32
#>  8 B3    b        23
#>  9 B3    c        65
#> 10 B3    d        96
#> # ... with 15 more rows

Created on 2022-01-14 by the reprex package (v2.0.1)

data.table

library(data.table)
setDT(df)[, .SD[!any(var2 < 20)], by = ID]
#>      ID var1 var2
#>  1:  B1    a   47
#>  2:  B1    b   81
#>  3:  B1    b   95
#>  4:  B1    c   48
#>  5:  B1    d   43
#>  6:  B4    a   77
#>  7:  B4    c   54
#>  8:  B4    c   23
#>  9:  B4    b   55
#> 10:  B4    a   25
#> 11:  B6    a   98
#> 12:  B6    c   99
#> 13:  B6    c   86
#> 14:  B6    b   92
#> 15:  B6    a   33
#> 16:  B7    a   73
#> 17:  B7    b   94
#> 18:  B7    b   62
#> 19:  B7    c   40
#> 20:  B7    d   49
#> 21: B10    a   66
#> 22: B10    c   44
#> 23: B10    c   35
#> 24: B10    b   76
#> 25: B10    a   38
#>      ID var1 var2

Created on 2022-01-14 by the reprex package (v2.0.1)

Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14
0

I just found the answer here: How to remove all rows belonging to a particular group when only one row fulfills the condition in R?

This does the trick:

new.data <- subset(data, ave(var2 >=20, ID, FUN = all))
user303287
  • 131
  • 5
  • why does `new.data` have 0 rows? – Mwavu Jan 14 '22 at 14:38
  • sorry - this should be >=20, not < 20. I have amended this now. <20 retains all IDs that have var2 values above 20, and there aren't any, hence ```data.new``` had 0 rows – user303287 Jan 14 '22 at 14:53