0

I have this type of data I have only two groups right now and an user can be part of single/multiple group. There are ~100k users

id  grp
1001    A
1001    B
1002    A
1002    A
1003    B
1003    B

I want to create tables which have records having Only from grp A / B etc.

Conceptual output is below, based on above data

ONLY A  1002
ONLY B  1003
BOTH    1001

Please share a R solution and preferable data.table or sql way

Based on answers from @Procrastinatus Maximus, here are some test results

N <- 100000
set.seed(1)
DT <- data.table(
  id = sample(N/2, N, TRUE),
  grp = sample(c('A','B'), N, TRUE)
)[order(id)]
DT

DT[,.N, by=grp]
   grp     N
1:   B 50170
2:   A 49830

This approach solved my problem (suggested by @ Procrastinatus Maximus), why this takes so much time as compared to other

> system.time(DT[, .SD[uniqueN(grp)==1], by = id])
   user  system elapsed 
 31.064   0.147  31.244 

> system.time(DT[, .(grps = toString(unique(grp))), by = id])
   user  system elapsed 
  2.022   0.011   1.987 
> system.time(unique(DT)[order(grp), .(grps = toString(grp)), by = id])
   user  system elapsed 
  0.707   0.003   0.710 
> system.time(DT[, list(grp = paste(grp, collapse = " | ")), by = id])
   user  system elapsed 
  0.244   0.001   0.245 
> system.time(aggregate(grp ~ id, DT, function(x) toString(unique(x))))
   user  system elapsed 
  2.673   0.004   2.680 
> system.time(sqldf('select id, group_concat(distinct grp) from DT group by id'))
   user  system elapsed 
  0.445   0.000   0.445 
usct01
  • 838
  • 7
  • 18
  • Could you clarify if this example covers your entire need, or that there might also be a grp C and that you are interested in A and C, B and C, A and B and all of the groups? – Wannes Rosiers Jul 20 '16 at 12:09
  • I have only two groups right now and an user can be part of single/multiple group. There are ~100k users. Hope this helps – usct01 Jul 20 '16 at 12:11
  • Did you even try to solve it by yourself ? SO is not a free code service... – Tensibai Jul 20 '16 at 12:14
  • Yes, I tried a solution using below method, however I feel there is a better way (data.table way). I created 2 tables for each group and then used anti join from dplyr. I am not sure how to do this using data.table – usct01 Jul 20 '16 at 12:15
  • 1
    using **data.table**: `unique(dt)[, .(toString(grp)), by = id]` – Jaap Jul 20 '16 at 12:31

1 Answers1

4

Supposing your data is already in a data.table (if not, convert it to a data.table with setDT(name_of_your_dataframe)):

library(data.table)
# option 1
unique(DT)[, .(grps = toString(grp)), by = id]
# option 2
DT[, .(grps = toString(unique(grp))), by = id]

which gives:

     id grps
1: 1001 A, B
2: 1002    A
3: 1003    B

As suggested by @Frank: to get the same sequences, it is a good idea to order by the grp column:

unique(DT)[order(grp), .(grps = toString(grp)), by = id]

Several other alternatives:

1) base R:

aggregate(grp ~ id, DT, function(x) toString(unique(x)))

2) dplyr:

library(dplyr)
DT %>% group_by(id) %>% summarise(grps = toString(unique(grp)))

3) sqldf:

library(sqldf)
sqldf('select id, group_concat(distinct grp) from DT group by id')
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thanks a lot for your input, I actually used your first solution, DT[, .SD[uniqueN(grp)==1], by = id]. This is what I was looking for. Only thing is for 35k records and 2 columns it takes this much time, I may be wrong but this seems more user system elapsed 22.041 0.103 22.143 – usct01 Jul 21 '16 at 13:05
  • I use this code to aggregate strings ( in the problem I posted I don't need to collapse the grps, I want to create 2 separate tables) dt[, list(grp = paste(grp, collapse = " | ")), by = id] – usct01 Jul 21 '16 at 13:11
  • @usct01 Could you explain why you used `DT[, .SD[uniqueN(grp)==1], by = id]`? I didn't propose that in my answer. – Jaap Jul 24 '16 at 12:32
  • In your first response, there was a link which was showing that solution (I replaced > with ==). I will search for that link and post it if I get it – usct01 Jul 25 '16 at 05:14