1

I need to randomly sample a subset of n elements from a list in Scala, and I was wondering if there was a convenient way to do so without recourse to manually checking that each of the n elements are unique. At the moment I have something like this:

import util.Random

def sample(itms:List[A], sampleSize:Int) {
  var numbersSeen = Set[Int]()
  var sampled = List[A]()
  val itmLen = itms.size()
  var sampleIdex = Random.nextInt(itmLen)
  while(sampled < sampleSize) {
    if(numbersSeen.contains(sampleIdex)){
      sampleIdex = Random.nextInt(itmLen)
    } else {
      numbersSeen.add(sampleIdex)
      sampled.add(itms(sampleIdex))
    }
  }
  sampled
}

I was hoping there was something more elegant that can be done to either generate a non-repeating random list of integers in a range or to randomly sample n elements from a list.

John Sullivan
  • 1,301
  • 1
  • 10
  • 20

5 Answers5

5

If your list is not too long you could shuffle a list of index numbers and then march through that list.

In Scala that would be something like:

val aList = ('A' to 'Z').toList

val aListIterator = scala.util.Random.shuffle((0 until aList.length).toList).toIterator

and then in your looping structure:

...
if( aListIterator.hasNext ) aList(aListIterator.next)
...

If your list is huge, a function that returns a unique random number in the range of your list size (used as an index) might be a better approach. Jeff Preshing, recently blogged about unique random numbers, http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers.

Keith Pinson
  • 1,725
  • 1
  • 14
  • 17
3

You can pick one randomly, and sample from the list except the one you've just picked, with simpleSize-1 (tail-)recursively:

    def sample[A](itms:List[A], sampleSize:Int) = {

        def collect(vect: Vector[A], sampleSize: Int, acc : List[A]) : List[A] = {
            if (sampleSize == 0) acc
            else {
                val index = Random.nextInt(vect.size)
                collect( vect.updated(index, vect(0)) tail, sampleSize - 1, vect(index) :: acc)
            }
        }

        collect(itms toVector, sampleSize, Nil)
    }                                 //> sample: [A](itms: List[A], sampleSize: Int)List[A]


    sample(1 to 10 toList, 5)         //> res0: List[Int] = List(6, 8, 2, 1, 10)
Keyel
  • 144
  • 4
  • This is `O(n*k)` where `n` is the length of the list and `k` is the number of items. Not a good choice unless `k < log(n`). – Rex Kerr Feb 14 '13 at 11:48
  • I think the collect part is only O(k), considering all the vector operations take effective constant times and collect is called k times (k=sampleSize). There is a one time O(n) operation (toVector) at first but its only adds O(n), not multiplies. – Keyel Feb 14 '13 at 16:39
  • sample(itms, sampleSize) takes a bit less than a sec on my comp with n = 1 000 000 and k = 100 000 – Keyel Feb 14 '13 at 18:08
  • You're right--I made a mistake about where the vector and list operations were. Sorry! – Rex Kerr Feb 14 '13 at 20:20
  • Np :) 12345678912 – Keyel Feb 14 '13 at 23:29
1
itms.map(x => (x, util.Random.nextDouble)).sortBy(_._2).take(sampleSize).map(_._1)

as long as you don't care about the inefficiency of sort.

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
0

You could take a random sample from the set of subsets, i.e.:

val distinctSubsets = itms.to[Set].subsets(sampleSize)

Then choose one of those randomly.

Alex DiCarlo
  • 4,851
  • 18
  • 34
-1

What about this approach?

trait RandomOrdering[T] extends Ordering[T]

object RandomOrdering {
  implicit def defaultOrdering[T] = new RandomOrdering[T] {
    def compare(x:T, y:T) = (Random nextInt 3) - 1
  }
}

def sample[A](items:List[A], sampleSize:Int)(implicit r:RandomOrdering[A]) =
  items.sorted take sampleSize

It might be less performant but it also allows you to inject a different RandomOrdering.

EECOLOR
  • 11,184
  • 3
  • 41
  • 75
  • 1
    An ordering that gives random answers is not guaranteed to produce a randomly sorted set of numbers. (In particular, it's a disaster for insertion sort, which almost all sorts default to for small numbers.) – Rex Kerr Feb 13 '13 at 21:23
  • I do not understand, what do you mean? Using OP's method I could in theory also get a list of the same ordering as the original. If I am missing something please enlighten me. – EECOLOR Feb 13 '13 at 21:28
  • 1
    Every element should be equally likely to be the first element in your list, regardless of how they're sorted or compared. This is not true at all for your solution. You will produce a rather scrambled order, but not random. – Rex Kerr Feb 13 '13 at 21:30