I have this type of data I have only two groups right now and an user can be part of single/multiple group. There are ~100k users
id grp
1001 A
1001 B
1002 A
1002 A
1003 B
1003 B
I want to create tables which have records having Only from grp A / B etc.
Conceptual output is below, based on above data
ONLY A 1002
ONLY B 1003
BOTH 1001
Please share a R solution and preferable data.table or sql way
Based on answers from @Procrastinatus Maximus, here are some test results
N <- 100000
set.seed(1)
DT <- data.table(
id = sample(N/2, N, TRUE),
grp = sample(c('A','B'), N, TRUE)
)[order(id)]
DT
DT[,.N, by=grp]
grp N
1: B 50170
2: A 49830
This approach solved my problem (suggested by @ Procrastinatus Maximus), why this takes so much time as compared to other
> system.time(DT[, .SD[uniqueN(grp)==1], by = id])
user system elapsed
31.064 0.147 31.244
> system.time(DT[, .(grps = toString(unique(grp))), by = id])
user system elapsed
2.022 0.011 1.987
> system.time(unique(DT)[order(grp), .(grps = toString(grp)), by = id])
user system elapsed
0.707 0.003 0.710
> system.time(DT[, list(grp = paste(grp, collapse = " | ")), by = id])
user system elapsed
0.244 0.001 0.245
> system.time(aggregate(grp ~ id, DT, function(x) toString(unique(x))))
user system elapsed
2.673 0.004 2.680
> system.time(sqldf('select id, group_concat(distinct grp) from DT group by id'))
user system elapsed
0.445 0.000 0.445