15

I want to permute a vector so that an element can't be in the same place after permutation, as it was in the original. Let's say I have a list of elements like this: AABBCCADEF

A valid shuffle would be: BBAADEFCCA

But these would be invalid: BAACFEDCAB or BCABFEDCAB

The closest answer I could find was this: python shuffle such that position will never repeat. But that's not quite what I want, because there are no repeated elements in that example.

I want a fast algorithm that generalizes that answer in the case of repetitions.

MWE:

library(microbenchmark)

set.seed(1)
x <- sample(letters, size=295, replace=T)

terrible_implementation <- function(x) {
  xnew <- sample(x)
  while(any(x == xnew)) {
    xnew <- sample(x)
  }
  return(xnew)
}

microbenchmark(terrible_implementation(x), times=10)


Unit: milliseconds
                       expr      min       lq    mean  median       uq      max neval
 terrible_implementation(x) 479.5338 2346.002 4738.49 2993.29 4858.254 17005.05    10

Also, how do I determine if a sequence can be permuted in such a way?

EDIT: To make it perfectly clear what I want, the new vector should satisfy the following conditions:

1) all(table(newx) == table(x)) 2) all(x != newx)

E.g.:

newx <- terrible_implementation(x)
all(table(newx) == table(x))
[1] TRUE
all(x != newx)
[1] TRUE
thc
  • 9,527
  • 1
  • 24
  • 39
  • A vague guess at deciding if the sequence is shuffle-able like this is that the most common element has to have at most `N / 2` repeats - the sequence definitely becomes unshuffleable beyond that, not sure if there are other ways for unshuffleability to occur. – Marius Nov 09 '17 at 00:44
  • What's wrong with your implementation? – Hugh Nov 09 '17 at 02:20
  • I thought I could be clever and only subscramble the elements that did not satisfy the condition. Of course, this won't work if there are more pigeons than pigeonholes. – Hugh Nov 09 '17 at 02:48
  • 1
    @Hugh it's way too slow. My real data is more like a vector of 1 million, and 1000 unique elements. – thc Nov 09 '17 at 04:09
  • Does the outcome need to have any degree of randomness? Would a Caesar cipher be adequate? – Hugh Nov 09 '17 at 04:11
  • Would appreciate if randomness could be implemented, but I guess it's not essential. I'm not sure how a Caesar cipher would be applicable here? – thc Nov 09 '17 at 04:18
  • So replace every A with B, every B with C, ... every Z with A. Or in general every unique element with the 'next' unique element for some ordering. It would be quick (microseconds on your example) but obviously deterministic. – Hugh Nov 09 '17 at 04:33
  • 1
    That would change the number of each letter. E.g., if my original was ABCC, BCAA wouldn't be valid. – thc Nov 09 '17 at 04:49

3 Answers3

4
#DATA
set.seed(1)
x <- sample(letters, size=295, replace=T)

foo = function(S){
    if(max(table(S)) > length(S)/2){
        stop("NOT POSSIBLE")
    }
    U = unique(S)
    done_chrs = character(0)
    inds = integer(0)
    ans = character(0)
    while(!identical(sort(done_chrs), sort(U))){
        my_chrs = U[!U %in% done_chrs]
        next_chr = my_chrs[which.min(sapply(my_chrs, function(x) length(setdiff(which(!S %in% x), inds))))]
        x_inds = which(S %in% next_chr)
        candidates = setdiff(seq_along(S), union(x_inds, inds))
        if (length(candidates) == 1){
            new_inds = candidates
        }else{
            new_inds = sample(candidates, length(x_inds))
        }
        inds = c(inds, new_inds)
        ans[new_inds] = next_chr
        done_chrs = c(done_chrs, next_chr)
    }
    return(ans)
}

ans_foo = foo(x)

identical(sort(ans_foo), sort(x)) & !any(ans_foo == x)
#[1] TRUE

library(microbenchmark)
microbenchmark(foo(x))
#Unit: milliseconds
#   expr      min       lq     mean   median       uq      max neval
# foo(x) 19.49833 22.32517 25.65675 24.85059 27.96838 48.61194   100
d.b
  • 32,245
  • 6
  • 36
  • 77
2

I think this satisfies all your conditions. The idea is to order by the frequency, start with the most common element and shift the value to the next value in the frequency table by the number of times the most common element appears. This will guarantee all elements will be missed.

I've written in data.table, as it helped me during debugging, without losing too much performance. It's a modest improvement performance-wise.

library(data.table)
library(magrittr)
library(microbenchmark)


permute_avoid_same_position <- function(y) {
  DT <- data.table(orig = y)
  DT[, orig_order := .I]

  count_by_letter <- 
    DT[, .N, keyby = orig] %>%
    .[order(N)] %>%
    .[, stable_order := .I] %>%
    .[order(-stable_order)] %>%
    .[]

  out <- copy(DT)[count_by_letter, .(orig, orig_order, N), on = "orig"]
  # Dummy element
  out[, new := first(y)]
  origs <- out[["orig"]]
  nrow_out <- nrow(out)
  maxN <- count_by_letter[["N"]][1]

  out[seq_len(nrow_out) > maxN, new := head(origs, nrow_out - maxN)]
  out[seq_len(nrow_out) <= maxN, new := tail(origs, maxN)]

  DT[out, j = .(orig_order, orig, new), on = "orig_order"] %>%
    .[order(orig_order)] %>%
    .[["new"]]
}

set.seed(1)
x <- sample(letters, size=295, replace=T)
testthat::expect_true(all(table(permute_avoid_same_position(x)) == table(x)))
testthat::expect_true(all(x != permute_avoid_same_position(x)))
microbenchmark(permute_avoid_same_position(x), times = 5)

# Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max
# permute_avoid_same_position(x) 5.650378 5.771753 5.875116 5.788618 5.938604 6.226228

x <- sample(1:1000, replace = TRUE, size = 1e6)
testthat::expect_true(all(table(permute_avoid_same_position(x)) == table(x)))
testthat::expect_true(all(x != permute_avoid_same_position(x)))

microbenchmark(permute_avoid_same_position(x), times = 5)
# Unit: milliseconds
#                           expr      min       lq    mean   median       uq      max
# permute_avoid_same_position(x) 239.7744 385.4686 401.521 438.2999 440.9746 503.0875
Hugh
  • 15,521
  • 12
  • 57
  • 100
  • Thanks! This works well. The solution actually seems kind of obvious after you describe the algorithm, lol. – thc Nov 09 '17 at 20:52
  • Great: FYI - I removed the performance bottleneck since you accepted. – Hugh Nov 09 '17 at 23:57
-2

We could extract substrings by the boundary of the repeating elements, sample and replicate

library(stringr)
sapply(replicate(10, sample(str_extract_all(str1, "([[:alpha:]])\\1*")[[1]]),
                simplify = FALSE), paste, collapse="")
#[1] "BBAAEFDCCA" "AAAFBBEDCC" "BBAAAEFCCD" "DFACCBBAAE" "AAFCCBBEAD" 
#[6] "DAAAECCBBF" "AAFCCDBBEA" "CCEFADBBAA" "BBAAEADCCF" "AACCBBDFAE"

data

str1 <- "AABBCCADEF"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Please have a look at the MWE. Your solution didn't work on that. – thc Nov 09 '17 at 04:56
  • @thc It is based on your description `A valid shuffle would be: BBAADEFCCA But these would be invalid: BAACFEDCAB or BCABFEDCAB` – akrun Nov 09 '17 at 04:56
  • I understand, but it doesn't work even if I increased replicate from 10 to 1000 on the MWE (not the BBAADEFCCA string). – thc Nov 09 '17 at 04:58
  • @thc If you are looking for unique, then use `!duplicated` or `unique` – akrun Nov 09 '17 at 04:58
  • I think I didn't explain my question well... I want the length and content of the new vector to be the same. So e.g. `table(newx) == table(x)` I will get the same results. But also `all(x != newx)`. I'll try to update the description. – thc Nov 09 '17 at 05:07
  • @thc If you look at your example and my example., my example is a string, while yours is a vector (though your description looks like a string). If you want to get the table, then you have to split the string in my example – akrun Nov 09 '17 at 05:30
  • Please look at the code in the post for an example. – thc Nov 09 '17 at 05:32
  • @thc If `res` is my output, then `m1 <- sapply(strsplit(res, ""), table);all(rep(table(strsplit(str1, "")[[1]]), ncol(m1)) == m1)# [1] TRUE`. You have a vector and I started with a string. – akrun Nov 09 '17 at 05:34
  • As I mentioned, this doesn't work on the example I posted: `str1 <- sample(letters, size=295, replace=T) %>% paste0(collapse="")`. The second condition: `any(sapply(strsplit(res,""), function(r) all(r == strsplit(str1,""))))# [1] FALSE` – thc Nov 09 '17 at 05:59
  • @thc My conditional statement is different – akrun Nov 09 '17 at 06:04
  • 4
    Yes, it is different than the conditions required by the question. That's why it doesn't work on the example I posted. – thc Nov 09 '17 at 06:06