0

Say I'm grouping a data.table by a certain column and counting the values per group. Then I want to remove the groups that have count N < 2. Is there an efficient, data.table way to do this?

Example data:

id | col1
-------------
1  | "A"
2  | "A"
3  | "B"
4  | "C"
5  | "C"

now: group by col1 and count, remove rows that belong to group with count < 2

Example output: (row 3 was removed)

id | col1
-------------
1  | "A"
2  | "A"
4  | "C"
5  | "C"

I found Subset by group with data.table which is sort of similar, but I don't want to find a specific row per group, but rather identify entire groups if their aggregate satisfies a certain condition.

Thanks

marialagorda
  • 45
  • 2
  • 7

1 Answers1

0
values_2_keep <- DT[, .N, by = col1][N>=2, col1] 
# Alternatively DT[rowid(col1)==2L, col1]

DT[col1 %in% values_2_keep]

   id col1
1:  1    A
2:  2    A
3:  4    C
4:  5    C

Reproducible data:

DT <- data.table(
  id = 1:5,
  col1 = c("A", "A", "B", "C", "C")
)
s_baldur
  • 29,441
  • 4
  • 36
  • 69