1

This question is quite straightforward. However, the solutions that I have found to it are extremely memory and time inefficient. I am wondering if this can be done in R without grinding one's machine into dust.

Take a vector:

x<-c("A", "B", "B", "E", "C", "C", "D", "E", "A', "C")

This one has 10 elements. There are five unique elements. Therefore, importantly, some elements are repeated and any permutation should contain the same total number of each type of element. I wish to permute this sequence/vector 10,000 times with each one being a randomly generated and unique one. With my real data, I could be doing these permutations for up to 1000 elements. This can be very hard to do efficiently.

To get one permutation, you can just do:

sample(x)

or, from the gtools package:

permute(x)

I could write some code to do that 10,000 times, but am likely to have duplicates. Is there way of doing this and dropping duplicates until 10,000 is reached?

Other similar questions on stackoverflow and statsoverflow have asked question about generating all the unique permutations of a sequence. These questions are here:

Shuffling a vector - all possible outcomes of sample()?

Generating all distinct permutations of a list in R

https://stats.stackexchange.com/questions/24300/how-to-resample-in-r-without-repeating-permutations

These are good and the suggestions for generating all the unique permutations are great and it would certainly be quite easy to run them and sample 10,000 random samples from each to get our 10,000. However, if you go beyond about 10 elements in a vector then it gets very memory intensive.

Any comments about how to do this efficiently for up to 1000 elements appreciated. This has me getting very dizzy.

Community
  • 1
  • 1
jalapic
  • 13,792
  • 8
  • 57
  • 87
  • Do you want 10000 random unique permutations or would something that gives you 10000 unique (non necessarily random) permutations work? – Dason Jul 07 '14 at 02:12
  • I should clarify - the permutations should definitely be random. – jalapic Jul 07 '14 at 02:13
  • Why not overshoot a little, or write a function with a `while` condition and the `duplicated` function? I tried the following and it worked OK (reasonable time, and computer has not turned to dust) but I'm not sure if that's what you're looking for: `set.seed(1); N <- t(replicate(15000, sample(x))); sum(duplicated(N)); out <- N[!(duplicated(N)), ][1:10000, ]`. – A5C1D2H2I1M1N2O1R2T1 Jul 07 '14 at 04:18
  • Can you clarify two things: 1) You mentioned in the comments that the permutations should be random. Does this mean you are willing to accept duplicates? 2) Do you want to store all 10,000 permulations, or are you okay with sampling one permutations, using it, discarding it, and then sampling the next permutation? Because the latter would entail much less memory. – ved Jul 07 '14 at 04:49
  • @AnandaMahto - I don't have time tonight, will check in the am, but this does appear to possibly work. – jalapic Jul 07 '14 at 04:56
  • @ved My plan is to apply a function to each permuted sequence. I need to store the result of this function in a df. I don't need to store the generated permuted sequence itself for anything else so it could be discarded. I can't have any duplicates. – jalapic Jul 07 '14 at 04:58
  • @AnandaMahto - I tested this exact script a few times and it seems to work well. Do you want to post it as an answer so I can accept? – jalapic Jul 07 '14 at 14:39

3 Answers3

2

I don't think that the computations should be as expensive as you are making them to be. For small "x" vectors, you might want to overshoot a little bit (here, I've sort of overdone it), then check for duplicates using duplicated. If the difference between the number required and the number of duplicated rows is too much for you to get your desired 10,000, repeat the process to fill the difference, using rbind to add the ones you want to keep to the matrix you get from replicate. This could be implemented in a while loop.

x <- c("A", "B", "B", "E", "C", "C", "D", "E", "A", "C")
set.seed(1)
N <- t(replicate(15000, sample(x)))
sum(duplicated(N))
# [1] 1389
out <- N[!(duplicated(N)), ][1:10000, ]
head(out)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "B"  "E"  "C"  "D"  "B"  "E"  "A"  "C"  "C"  "A"  
# [2,] "B"  "B"  "C"  "C"  "C"  "D"  "E"  "E"  "A"  "A"  
# [3,] "C"  "B"  "C"  "A"  "A"  "E"  "D"  "C"  "B"  "E"  
# [4,] "C"  "C"  "E"  "B"  "C"  "E"  "A"  "A"  "D"  "B"  
# [5,] "A"  "C"  "D"  "E"  "E"  "C"  "A"  "B"  "B"  "C"  
# [6,] "C"  "E"  "E"  "B"  "A"  "C"  "D"  "A"  "B"  "C"

The duplicated step is actually the most expensive, as far as I can see:

y <- sample(500, 1000, TRUE)
system.time(N <- t(replicate(12000, sample(y))))
# user  system elapsed 
# 2.35    0.08    2.43 
system.time(D <- sum(duplicated(N)))
#  user  system elapsed 
# 14.82    0.01   14.84 
D
# [1] 0

^^ There, we have no duplicates in our 12,000 samples.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • thank you. That's really interesting that it is the duplicated step that is so memory intensive. For my data, I have utilized another work around solution. Using your method of efficiently generating the array of permutations, I then just do: N<-unique(N); N[sample(nrow(N), 10000), ] to e.g. get 10000 unique permutations. – jalapic Jul 07 '14 at 19:41
0

In case you are only interested in the first 10000 permutations (in dictionary order), you can make use of the iterpc library.

library(iterpc)
x <- c("A", "B", "B", "E", "C", "C", "D", "E", "A", "C")
I <- iterpc(table(x), ordered=TRUE)
# first 10000 permutations
result <- getnext(I, d=10000)

And it is very fast to get them.

> system.time(getnext(I, d=10000))
   user  system elapsed 
  0.004   0.000   0.005 
Randy Lai
  • 3,084
  • 2
  • 22
  • 23
-1

Here's an idea. This is not necessarily an answer but it's too big for a comment.

Get the permutations in an orderly way, and add them to a collection. For example, if elements are A, B, C, and D:

A B C D
A B D C
A D B C
... so on

And once you have got required number of permutations (10000 in your case), permute that collection once.

If the cost of randomization is the bottleneck, this approach should solve it.

sampathsris
  • 21,564
  • 12
  • 71
  • 98