3

I am trying to debug a short program, and I get a disconcerting result towards the end of sampling from the elements of a vector under some conditions. It happens as the elements of the vector that remain draw down to a single value.

In the specific case I'm referring to the vector is called remaining and contains a single element, the number 2. I would expect that any sampling of size 1 from this vector would stubbornly return 2, since 2 is the only element in the vector, but this is not the case:

Browse[2]> is.vector(remaining)
[1] TRUE
Browse[2]> sample(remaining,1)
[1] 2
Browse[2]> sample(remaining,1)
[1] 2
Browse[2]> sample(remaining,1)
[1] 1
Browse[2]> sample(x=remaining, size=1)
[1] 1
Browse[2]> sample(x=remaining, size=1)
[1] 2
Browse[2]> sample(x=remaining, size=1)
[1] 1
Browse[2]> sample(x=remaining, size=1)
[1] 1
Browse[2]> sample(x=remaining, size=1)
[1] 1

As you can see, sometimes the return is 1 and some others, 2.

What am I misunderstanding about the function sample()?

Antoni Parellada
  • 4,253
  • 6
  • 49
  • 114

1 Answers1

4

From help("sample"):

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x.

So, when you have remaining = 2, then sample(remaining) is equivalent to sample(x = 1:2)

Update

From the comments it's clear you are also looking for a way around this behavior. Here is a benchmark comparison of three mentioned alternatives:

library(microbenchmark)

# if remaining is of length one
remaining <- 2

microbenchmark(a = {if ( length(remaining) > 1 ) { sample(remaining) } else { remaining }},
               b = ifelse(length(remaining) > 1, sample(remaining), remaining),
               c = remaining[sample(length(remaining))])

Unit: nanoseconds
 expr  min   lq    mean median     uq   max neval cld
    a  349  489  625.12  628.0  663.5  3283   100 a  
    b 1536 1886 2240.58 2025.0 2165.5 13898   100  b 
    c 4051 4400 5193.41 4679.5 5064.0 38413   100   c

# If remaining is not of length one
remaining <- 1:10
microbenchmark(a = {if ( length(remaining) > 1 ) { sample(remaining) } else { remaining }},
               b = ifelse(length(remaining) > 1, sample(remaining), remaining),
               c = remaining[sample(length(remaining))])

Unit: microseconds
 expr    min      lq     mean median      uq    max neval cld
    a  5.238  5.7970  6.82703  6.251  6.9145 51.264   100  a 
    b 11.663 12.2920 13.14831 12.851 13.3745 34.851   100   b
    c  5.238  5.9715  6.57140  6.426  6.8450 14.667   100  a 

It looks like the suggestion from joran may be the fastest in your case if sample() is called much more often when remaining is of length > 1, and the if() {} else {} approach would be faster otherwise.

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • I read that, and I wasn't sure what it meant, or how to work around it. – Antoni Parellada Jan 29 '18 at 16:42
  • You should check if remaining is length 1, and avoid sampling in that case. – alan ocallaghan Jan 29 '18 at 16:43
  • @Toni This behavior is because sampling from a vector of length one is not meaningful; there is only one possible answer, which is the vector itself, not even some value derived from the vector. – duckmayr Jan 29 '18 at 16:44
  • @aocall That means an additional line or two in the code. There ought to be a way to work around this really messed up behaviour of the sample() function... – Antoni Parellada Jan 29 '18 at 16:44
  • `x[sample(length(x),1)]` – James Jan 29 '18 at 16:44
  • `remaining <- c( 2, 2 )` – vaettchen Jan 29 '18 at 16:47
  • @Toni This is a very old "gotcha" that originates in R (or S, probably) trying to be _helpful_ for interactive work, allowing a shorter specification for the common case `sample(1:n)` as just `sample(n)`. Most people agree in hindsight it was ill-advised, but the behavior is so old now, and so much relies on that behavior that we're stuck with it. – joran Jan 29 '18 at 16:49
  • @joran Would you advise placing then `sample` within and `ifelse()` statement? The issue is that `remainder` starts as a much longer vector within the loop, and it's only when it's withered down to 1 element that things start to go array... – Antoni Parellada Jan 29 '18 at 16:51
  • 1
    @Toni Maybe something along the lines of `if ( length(remaining) > 1 ) { sample(remaining) } else { remaining }` since you won't need the vectorization and slow down of `ifelse()` – duckmayr Jan 29 '18 at 16:53
  • @Toni I would recommend something like this: https://stackoverflow.com/a/13990144/324364 – joran Jan 29 '18 at 16:54
  • `resample <- function(x, ...) x[sample.int(length(x), ...)]` - see https://github.com/HenrikBengtsson/Wishlist-for-R/issues/19 – HenrikB Feb 01 '18 at 08:05