Find unique 'item groups' in multivariate data

Question

I am trying to isolate the unique groups of items in my data - unique groupings of rows associated with a key column, not unique items, which is what most use the unique function for. The question takes some careful reading...so please be kind enough to digest the example first.

To be clear, I do NOT want the unique subset of the group column, nor do I want unique subsets of items, nor even unique combinations of groups and items. I know these have been covered elsewhere unique() for more than one variable. What I want are unique sets of items, where sets are defined by groups.

Here is an example

set.seed(1234)
library(data.table)
A <- data.table(group = rep(c("A","B","C","D","E","F"),each = 4), 
item =  c(1, 2, 4, 3, 5, 2, 3, 6, 10, 12, 1, 2, 1, 2, 4, 3, 6, 3,
 5, 2, 10, 12, 1, 2), c = runif(8))
A <- A[-23, ] #so we can have an example of unbalanced groups
> A
    group item          c
 1:     A    1 0.15904600
 2:     A    2 0.03999592
 3:     A    4 0.21879954
 4:     A    3 0.81059855
 5:     B    5 0.52569755
 6:     B    2 0.91465817
 7:     B    3 0.83134505
 8:     B    6 0.04577026
 9:     C   10 0.15904600
10:     C   12 0.03999592
11:     C    1 0.21879954
12:     C    2 0.81059855
13:     D    1 0.52569755
14:     D    2 0.91465817
15:     D    4 0.83134505
16:     D    3 0.04577026
17:     E    6 0.15904600
18:     E    3 0.03999592
19:     E    5 0.21879954
20:     E    2 0.81059855
21:     F   10 0.52569755
22:     F   12 0.91465817
23:     F    2 0.04577026

#The unique groups are A:F, and the unique items are 1:6,10,12. 
#The unique sets of items are: # (set1) 1,2,3,4; (set2) 5,2,3,6; 
#(set3) 10,2,1,2; (set4) 10,12,2

I want to retrieve these unique sets of items (note again that the item sets are formed by groups). (The third column means little at this time. For fun, I include sums by each 'item'). The output table should look like this:

group item c 
A 1 0.68474355 #note that groups A and D share this same set of items (set1) 
A 2 0.95465409
A 4 1.05014459# c sums groupAitem4$c with groupDitem4$c
A 3 0.85636881
B 5 0.74449709 # group E has the same items (set2), even if not the same order, c is totaled by item.
B 2 1.72525672
B 3 0.87134097
B 6 0.20481626
C 10 0.159046
C 12 0.03999592
C 1 0.21879954
C 2 0.81059855
F 10 0.52569755 #Not the same as group C
F 12 0.91465817
F 2 0.04577026

I suppose there might be a way of going through reshape that would be quite awkward. My data is large, so efficient procedures like data.table would be very appreciated.

Not fancy/efficient but `strsplit(unique(paste0(A$group,",",A$item)),",")` — Jessica B, Aug 27 '13 at 13:03
How do you want to handle the various `A$c` values that belong to each unique pairing? Take a look at `aggregate` and `plyr` for general ideas. — Carl Witthoft, Aug 27 '13 at 13:10
possible duplicate of [Unique() for more than one variable](http://stackoverflow.com/questions/7790732/unique-for-more-than-one-variable) and http://stackoverflow.com/questions/9944816/unique-on-a-dataframe-with-only-selected-columns?rq=1 and http://stackoverflow.com/questions/10873203/r-find-all-unique-values-among-subsets-of-a-data-frame?rq=1 — IRTFM, Aug 27 '13 at 13:29
@CarlWitthoft: I can keep the first of the A$c columns, but that it the least of my worries. — user2627717, Aug 27 '13 at 14:05
@JessicaB: the strsplit... code you suggest does not seem to work. It gives me all the rows of A, duplicates and all. Note that what I want to keep are the unique combinations of items. — user2627717, Aug 27 '13 at 14:09
@DWin: I hope with the added notes, and with the example which had always been there, you now understand that my question is very different from the item you referenced. The other questions requested unique groups, I want unique groups of items. — user2627717, Aug 27 '13 at 14:27
The more I think of it, the more I realize that the unique function may not be up to this... other ideas would be appreciated. — user2627717, Aug 27 '13 at 15:35
My answer shows you the 23 unique group/item combinations - I notice your example data.table 'A' has no duplicated group/item combinations so I clearly don't understand the question, sorry! — Jessica B, Aug 28 '13 at 10:51

Metrics · Accepted Answer · 2013-08-28T14:11:29.850

2

library(plyr)  
my<-ddply(A,.(group),summarize, mylist=list(item))

> my
  group       mylist
1     A   1, 2, 4, 3
2     B   5, 2, 3, 6
3     C 10, 12, 1, 2
4     D   1, 2, 4, 3
5     E   6, 3, 5, 2
6     F    10, 12, 2

yy<-as.list(1:6) # used for `Map` function
my$mylist<-Map(function(x) sort(my$mylist[[x]]),yy) # sort the order of elements in list for matching

> my
  group       mylist
1     A   1, 2, 3, 4
2     B   2, 3, 5, 6
3     C 1, 2, 10, 12
4     D   1, 2, 3, 4
5     E   2, 3, 5, 6
6     F    2, 10, 12

myuni<-unique(my$mylist)

> myuni
[[1]]
[1] 1 2 3 4

[[2]]
[1] 2 3 5 6

[[3]]
[1]  1  2 10 12

[[4]]
[1]  2 10 12

finaloutput<-my[match(myuni,my$mylist),]
  group       mylist
1     A   1, 2, 3, 4
2     B   2, 3, 5, 6
3     C 1, 2, 10, 12
6     F    2, 10, 12

A[A$group %in% finaloutput$group,]
   group item           c
1      A    1 0.113703411
2      A    2 0.622299405
3      A    4 0.609274733
4      A    3 0.623379442
5      B    5 0.860915384
6      B    2 0.640310605
7      B    3 0.009495756
8      B    6 0.232550506
9      C   10 0.113703411
10     C   12 0.622299405
11     C    1 0.609274733
12     C    2 0.623379442
21     F   10 0.860915384
22     F   12 0.640310605
23     F    2 0.232550506

edited Aug 28 '13 at 14:11

answered Aug 27 '13 at 13:19

Metrics

15,172
7
54
83

Thanks for trying but same issue as Jessica's proposal above. This answer gives 8 rows not 4. Duplicates of the item sets are included. – user2627717 Aug 27 '13 at 14:13
It would be helpful to demonstrate your solutions with the reproducible code. – user2627717 Aug 27 '13 at 14:31
I am not sure what you are asking for. There is already the reproducible code. – Metrics Aug 27 '13 at 15:02
Please be kind enough to compare the outputs of your reproducible code with the output I want, as clearly described in the question. You will see that they do not match. Thanks for trying to help. – user2627717 Aug 27 '13 at 15:05
Not really. I explained that I do not want unique sets of items. If you could indulge me and try the example with: A <- data.frame(group = rep(1:12, each=4), item=sample(letters,6), c = runif(8)). The results should include all items in a group that is selected as containing a unique set. Yours would not. – user2627717 Aug 27 '13 at 15:28
I am sorry. You said that you want to match the output? – Metrics Aug 27 '13 at 15:30
I am sorry @Metrics. I will have to give up on trying to make you read and understand the question. It is not your garden-variety "unique" request. I would be grateful if you had taken the time to understand what I want. For now, I am convinced that is not case. – user2627717 Aug 27 '13 at 15:39
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/36349/discussion-between-user2627717-and-metrics) – user2627717 Aug 27 '13 at 15:58
Sorry, I am not available. I suggest you to edit the question and make it more clear with the expected output. Otherwise, it is likely to be good candidate for the closure. – Metrics Aug 27 '13 at 16:00
It's more clearer now. I think you can use unique item set as a list and then compare it. But, I am not sure whether your item set can be identified as a list in the real data. – Metrics Aug 27 '13 at 18:30
The answer looks great. The only step missing is the conversion of the table back to the 'reshape-long' format. I am not sure that it is so straightforward with lists. Would you have a hint? Thanks! – user2627717 Aug 28 '13 at 00:12
I don't have the solution for that at this point. But, I will try. In the meantime, it will be good if you post that as a new question by making reference to this solution (and of course your question) – Metrics Aug 28 '13 at 00:54
Thanks, but there is no way to get to the table I posted in the question without that step. It seems to be a critical element of the answer I want. – user2627717 Aug 28 '13 at 13:02

score 0 · Answer 2 · answered Aug 27 '13 at 13:21

0

If you just need to combinations

unique(dataset[, c("group", "item")])

answered Aug 27 '13 at 13:21

Thierry

18,049
5
48
66

No dice. I tried to explain in the question that I want all unique sets of items, sets of items are defined by group. Thanks. – user2627717 Aug 27 '13 at 14:14
I think you mean to ask for unique 2-way combinations of items rather than "unique sets of items". Peoples' understanding of mathematical sets may be getting in the way of communication here. – IRTFM Aug 27 '13 at 14:52
@DWin. I think the example in the question shows that I do not want 2-way combinations. I thought some effort went into making sure that my question does not get mixed up that way... oh well. – user2627717 Aug 27 '13 at 15:32

score 0 · Answer 3 · answered Aug 27 '13 at 14:53

0

Since you don't use set.seed, or dput ,everyone trying to use your code will get a different result. This may give you what you want although at the moment it is unclear if the number of items in a groups will always be small and whether it is only the 2way combinations that are desired:

unique(t(do.call(cbind, tapply(A$item, A$group, combn, 2) ) )  )

The combn function returns the unique combinations in a column format so I needed to transpose before using unique which operates on rows by default. If you could work with a column oriented result you can skip that step if you use the MARGIN argument:

unique(do.call(cbind, tapply(A$item, A$group, combn, 2) )  , MARGIN=2)

answered Aug 27 '13 at 14:53

IRTFM

258,963
21
364
487

If you are expecting to match with your output, you just need to use `A[unique(A$item),]`. See my answer – Metrics Aug 27 '13 at 15:17
@Metrics: Please see my response to your answer. – user2627717 Aug 27 '13 at 15:30
@Metrics: I hope the example helps to explain what I want. I understand the question is tricky. – user2627717 Aug 27 '13 at 17:05

Find unique 'item groups' in multivariate data

3 Answers3