4

I am trying to find all of the unique groupings of a vector/list of items, length 39. Below is the code I have:

x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
       "TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
       "TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
       "TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
       "TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
       "TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
       "TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
       "TmaxKLEX","TminKSDF","TmaxKSDF")

# Generate a list with the combinations  
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z)))) 

However, the code causes my computer to run out of memory. Is there a better way to do this? I realize I have a large list. thanks.

smci
  • 32,567
  • 20
  • 113
  • 146
HazelnutCoffee
  • 175
  • 1
  • 1
  • 9
  • *ALL* groupings (as in all subsets) or just all pairs? – Iterator Aug 05 '11 at 17:19
  • 7
    Um....`sum(choose(39,1:39))` = 549755813887. Think about how big that number is for a second, and now think about whether you really want to do this. – joran Aug 05 '11 at 17:22
  • 4
    @joran: When you say think how big it is ... you mean like: "549755813887 or ~549 Billion, is 1/25 the US Debt" ? ;-) – Kyle Brandt Aug 05 '11 at 17:27
  • 1
    @RyanB, I realize you think you want all combinations... but I suspect you don't really need all combinations. Is this a maximization problem of some type? If it is, you might ask a different question about how to solve a certain type of goal seek rather than brute forcing every possible combination. – JD Long Aug 05 '11 at 17:47
  • @Kyle I was thinking more along the lines of "~549 Billion is almost 4 times the avg dist from the sun to the earth in *meters*." But yours works too! ;) – joran Aug 05 '11 at 18:05
  • 549B? Just use `mmap`. Or Hadoop. :) – Iterator Aug 05 '11 at 18:13
  • 2
    @joran: Some people use `sum(choose())` and some just calculate `2^39 - 1`. :) – Iterator Aug 05 '11 at 18:15
  • How is this different than your previous question, [R Question Number of Unique Combinations of A,A,A,A,B,B,B,B,B](http://stackoverflow.com/questions/6164620/r-question-number-of-unique-combinations-of-a-a-a-a-b-b-b-b-b)? – Joshua Ulrich Aug 05 '11 at 19:40
  • the size involved. hence the question "Is there a better way to do this?" – HazelnutCoffee Aug 05 '11 at 20:09
  • The size is N=39. You want the power-set, i.e. all 2^39 -1 possible combinations of all possible lengths 1..N, not just combinations two-at-a-time. Tagged [[tag:power-set]] – smci May 20 '18 at 08:29
  • It helps if you show us why you want these, i.e. show us the rest of the pipeline after this. Representing them as bitfields of length 39 probably beats manipulating arbitrary-length lists/vectors of strings. If you're considering set membership to do clustering, heatmapping etc. there are well-known libraries for all of those, written in C++. This is almost surely an XY problem. Tell us your real need. – smci May 20 '18 at 08:46

2 Answers2

3

To calculate all unique subsets, you are simply creating all binary vectors with the same length as the cardinality of the original set of items. If there are 39 items, then you are looking at all binary vectors of length 39. Each element of each vector identifies, yes or no, whether or not the item is in the corresponding subset.

As there are 39 items, and each can either be in or not-in a given subset, then there are 2^39 possible subsets. Excluding the empty set, i.e. the all-0 vector, you have 2^39 - 1 possible subsets.

That is, as @joran said, about 549B vectors. Given that the binary vectors are most compactly representing the data (i.e. without strings), then you will need 549B * 39 bits to return all of the subsets. I don't think you want to store this: that's about 2.68E12 bytes. If you insist on using the characters, you're likely to be in the many tens of terabytes.

It's certainly feasible to buy a system that can support this, but not very cost-effective.

At a meta-level, it is very likely, as @JD said, that this is not the path you really need to go. I recommend posting a new question and maybe it can be refined here or on the statistics-related SE site.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • FWIW, even if you don't store it, transmitting the data would take some time. Any decent computer should be fine generating all of the 39 bit vectors very rapidly. See "brute force password cracker" for illustrative purposes. :) DES was a 56-bit key, so a bit longer, and it required a few more computations per key. – Iterator Aug 05 '11 at 19:36
0

You might try using expand.grid.

Create a data frame from all combinations of the supplied vectors or factors. See the description of the return value for precise details of the way this is done.

Kyle Brandt
  • 26,938
  • 37
  • 124
  • 165
  • 1
    That ain't going to be pretty - I hope the OP and plenty of time on their hands, a huge stack of RAM, and plenty patience to weed through the resulting combinations... – Gavin Simpson Aug 05 '11 at 17:37
  • @Gavin: Agreed, just going to keep this here in case it helps a future person with a smaller factor set. – Kyle Brandt Aug 05 '11 at 18:12
  • 2
    i am looking for all unique subsets. is that really in the billions? – HazelnutCoffee Aug 05 '11 at 18:43
  • @RyanB: So this all falls under basic "combinatorics". A few basic things that come up are: "Do I care about the order" Do I want to know all combinations that include all items, or all combinations of 5 from a set of 20? These sort of things *greatly* change the size of the outcome. It will probably take your an hour to get the basics down with a probability for dummies sort of book and would be worth your time. – Kyle Brandt Aug 05 '11 at 19:13
  • 1
    Are you the same Kyle Brandt that is taking down SO (http://blog.serverfault.com/post/stack-exchange-maintenance-this-saturday/)? Please tell us you're not about to run this on SO servers. ;-) – Iterator Aug 05 '11 at 19:31