5
 library(data.table)
 DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
 DT2 <- data.table(group = c("A", "B", "C"))

I want to add a column popular to DT2 with value TRUE whenever DT2$group is contained in DT1$group at least twice. So, in the example above, DT2 should be

    group popular
 1:     A    TRUE
 2:     B    TRUE
 3:     C   FALSE

What would be an efficient way to get to this?

Updated example: DT2 may actually contain more groups than DT1, so here's an updated example:

 DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
 DT2 <- data.table(group = c("A", "B", "C", "D"))

And the desired output would be

    group popular
 1:     A    TRUE
 2:     B    TRUE
 3:     C   FALSE
 4:     D   FALSE
johnl
  • 310
  • 1
  • 3
  • 11
  • 1
    Does `DT2` always have the same unique `group` from `DT1`? If so, it's simply `DT1[, .(popular = .N >= 2L), by=group]` – Arun Oct 19 '14 at 17:57
  • Hi @Arun, no it does not; I over-simplified the example here, sorry. ``DT2`` may contain more groups than ``DT1``, in which case popular should be ``FALSE`` (as it's not contained in ``DT1$group`` at least twice). – johnl Oct 19 '14 at 18:04
  • +1 for a well-formulated question – Rich Scriven Oct 19 '14 at 18:21
  • here is the non data table way! `table(factor(DT1$group, levels = unique(DT2$group))) >= 2` – rawr Oct 19 '14 at 20:18

2 Answers2

10

I'd just do it this way:

## 1.9.4+
setkey(DT1, group)
DT1[J(DT2$group), list(popular = .N >= 2L), by = .EACHI]
#    group popular
# 1:     A    TRUE
# 2:     B    TRUE
# 3:     C   FALSE
# 4:     D   FALSE ## on the updated example

data.table's join syntax is quite powerful, in that, while joining, you can also aggregate / select / update columns in j. Here we perform a join. For each row in DT2$group, on the corresponding matching rows in DT1, we compute the j-expression .N >= 2L; by specifying by = .EACHI (please check 1.9.4 NEWS), we compute the j-expression each time.


In 1.9.4, .() has been introduced as an alias in all i, j and by. So you could also do:

DT1[.(DT2$group), .(popular = .N >= 2L), by = .EACHI]

When you're joining by a single character column, you can drop the .() / J() syntax altogether (for convenience). So this can be also written as:

DT1[DT2$group, .(popular = .N >= 2L), by = .EACHI]
Henrik
  • 65,555
  • 14
  • 143
  • 159
Arun
  • 116,683
  • 26
  • 284
  • 387
3

This is how I would do it: first count the number of times each group appears in DT1, then simply join DT2 and DT1.

require(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))

#solution:
DT1[,num_counts:=.N,by=group] #the number of entries in this group, just count the other column
setkey(DT1, group)
setkey(DT2, group)
DT2 = DT1[DT2,mult="last"][,list(group, popular = (num_counts >= 2))]

#> DT2
#   group popular
#1:     A    TRUE
#2:     B    TRUE
#3:     C   FALSE
Alex
  • 19,533
  • 37
  • 126
  • 195
  • This can be simplified further by aggregating `DT1` instead of `updating` it. – Arun Oct 19 '14 at 17:56
  • i think updating is more efficient: aggregating would involve copying while updating does not require any copying – Alex Oct 19 '14 at 18:04