13

I'm attempting to retrieve n unique random elements for further processing from a Collection using the Streams API in Java 8, however, without much or any luck.

More precisely I'd want something like this:

Set<Integer> subList = new HashSet<>();
Queue<Integer> collection = new PriorityQueue<>();
collection.addAll(Arrays.asList(1,2,3,4,5,6,7,8,9));
Random random = new Random();
int n = 4;
while (subList.size() < n) {
  subList.add(collection.get(random.nextInt()));
}
sublist.forEach(v -> v.doSomethingFancy());

I want to do it as efficiently as possible.

Can this be done?

edit: My second attempt -- although not exactly what I was aiming for:

List<Integer> sublist = new ArrayList<>(collection);
Collections.shuffle(sublist);
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());

edit: Third attempt (inspired by Holger), which will remove a lot of the overhead of shuffle if coll.size() is huge and n is small:

int n = // unique element count
List<Integer> sublist = new ArrayList<>(collection);   
Random r = new Random();
for(int i = 0; i < n; i++)
    Collections.swap(sublist, i, i + r.nextInt(source.size() - i));
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
Community
  • 1
  • 1
habitats
  • 2,203
  • 2
  • 23
  • 31
  • 2
    Why streams? If you want efficiency, use `Collections.shuffle()` and grab a sublist – fge Feb 21 '15 at 22:10
  • 2
    The Collection I'm actually trying to do this on is a Queue, thus shuffle won't work unless I first convert it to a list. Also, I want to learn more about the Stream API:) – habitats Feb 21 '15 at 22:12
  • 1
    Aside: if you want a list containing sequential integers, you can use something like `IntStream.range(0, last).boxed().collect(toList());` – Stuart Marks Feb 22 '15 at 06:14
  • are there ever equal objects in the collection/queue? e.g.(`queue.addAll(Arrays.asList(1,2,3,2,1));`) if yes - does "unique" mean "not the same" or "not equal to already selected"? – harshtuna Feb 26 '15 at 00:46
  • Yes, a collection can har duplicates. Sets can not. However, distinct in this context means not picking the same index in the collection twice. – habitats Feb 26 '15 at 06:39

8 Answers8

14

The shuffling approach works reasonably well, as suggested by fge in a comment and by ZouZou in another answer. Here's a generified version of the shuffling approach:

static <E> List<E> shuffleSelectN(Collection<? extends E> coll, int n) {
    assert n <= coll.size();
    List<E> list = new ArrayList<>(coll);
    Collections.shuffle(list);
    return list.subList(0, n);
}

I'll note that using subList is preferable to getting a stream and then calling limit(n), as shown in some other answers, because the resulting stream has a known size and can be split more efficiently.

The shuffling approach has a couple disadvantages. It needs to copy out all the elements, and then it needs to shuffle all the elements. This can be quite expensive if the total number of elements is large and the number of elements to be chosen is small.

An approach suggested by the OP and by a couple other answers is to choose elements at random, while rejecting duplicates, until the desired number of unique elements has been chosen. This works well if the number of elements to choose is small relative to the total, but as the number to choose rises, this slows down quite a bit because of the likelihood of choosing duplicates rises as well.

Wouldn't it be nice if there were a way to make a single pass over the space of input elements and choose exactly the number wanted, with the choices made uniformly at random? It turns out that there is, and as usual, the answer can be found in Knuth. See TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling, Algorithm S.

Briefly, the algorithm is to visit each element and decide whether to choose it based on the number of elements visited and the number of elements chosen. In Knuth's notation, suppose you have N elements and you want to choose n of them at random. The next element should be chosen with probability

(n - m) / (N - t)

where t is the number of elements visited so far, and m is the number of elements chosen so far.

It's not at all obvious that this will give a uniform distribution of chosen elements, but apparently it does. The proof is left as an exercise to the reader; see Exercise 3 of this section.

Given this algorithm, it's pretty straightforward to implement it in "conventional" Java by looping over the collection and adding to the result list based on the random test. The OP asked about using streams, so here's a shot at that.

Algorithm S doesn't lend itself obviously to Java stream operations. It's described entirely sequentially, and the decision about whether to select the current element depends on a random decision plus state derived from all previous decisions. That might make it seem inherently sequential, but I've been wrong about that before. I'll just say that it's not immediately obvious how to make this algorithm run in parallel.

There is a way to adapt this algorithm to streams, though. What we need is a stateful predicate. This predicate will return a random result based on a probability determined by the current state, and the state will be updated -- yes, mutated -- based on this random result. This seems hard to run in parallel, but at least it's easy to make thread-safe in case it's run from a parallel stream: just make it synchronized. It'll degrade to running sequentially if the stream is parallel, though.

The implementation is pretty straightforward. Knuth's description uses random numbers between 0 and 1, but the Java Random class lets us choose a random integer within a half-open interval. Thus all we need to do is keep counters of how many elements are left to visit and how many are left to choose, et voila:

/**
 * A stateful predicate that, given a total number
 * of items and the number to choose, will return 'true'
 * the chosen number of times distributed randomly
 * across the total number of calls to its test() method.
 */
static class Selector implements Predicate<Object> {
    int total;  // total number items remaining
    int remain; // number of items remaining to select
    Random random = new Random();

    Selector(int total, int remain) {
        this.total = total;
        this.remain = remain;
    }

    @Override
    public synchronized boolean test(Object o) {
        assert total > 0;
        if (random.nextInt(total--) < remain) {
            remain--;
            return true;
        } else {
            return false;
        }
    }
}

Now that we have our predicate, it's easy to use in a stream:

static <E> List<E> randomSelectN(Collection<? extends E> coll, int n) {
    assert n <= coll.size();
    return coll.stream()
        .filter(new Selector(coll.size(), n))
        .collect(toList());
}

An alternative also mentioned in the same section of Knuth suggests choosing an element at random with a constant probability of n / N. This is useful if you don't need to choose exactly n elements. It'll choose n elements on average, but of course there will be some variation. If this is acceptable, the stateful predicate becomes much simpler. Instead of writing a whole class, we can simply create the random state and capture it from a local variable:

/**
 * Returns a predicate that evaluates to true with a probability
 * of toChoose/total.
 */
static Predicate<Object> randomPredicate(int total, int toChoose) {
    Random random = new Random();
    return obj -> random.nextInt(total) < toChoose;
}

To use this, replace the filter line in the stream pipeline above with

        .filter(randomPredicate(coll.size(), n))

Finally, for comparison purposes, here's an implementation of the selection algorithm written using conventional Java, that is, using a for-loop and adding to a collection:

static <E> List<E> conventionalSelectN(Collection<? extends E> coll, int remain) {
    assert remain <= coll.size();
    int total = coll.size();
    List<E> result = new ArrayList<>(remain);
    Random random = new Random();

    for (E e : coll) {
        if (random.nextInt(total--) < remain) {
            remain--;
            result.add(e);
        }
    }            

    return result;
}

This is quite straightforward, and there's nothing really wrong with this. It's simpler and more self-contained than the stream approach. Still, the streams approach illustrates some interesting techniques that might be useful in other contexts.


Reference:

Knuth, Donald E. The Art of Computer Programming: Volume 2, Seminumerical Algorithms, 2nd edition. Copyright 1981, 1969 Addison-Wesley.

Community
  • 1
  • 1
Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
  • 5
    Excellent answer, as always! +1 :) – Alexis C. Feb 22 '15 at 08:18
  • Well, what can I say. Very interesting answer, although it feels a little like we're killing sparrows with a rail gun -- but sometimes that's the solution we need. I've never experimented with a custom Selector before. Fancy stuff! I'll leave the question as answered, and might post some performance comparison later. – habitats Feb 22 '15 at 10:22
  • 1
    I personally don't feel relying on a mutable predicate is a good idea. Can't the same idea be done with an immutable one (I'm sure in Scala that wouldn't be that much more difficult, but don't know about Java 8). – herman Feb 22 '15 at 12:55
  • 1
    @habitats Yeah the streams approach is quite possibly overkill. I added an example of the conventional approach for comparison. – Stuart Marks Feb 22 '15 at 17:35
  • 1
    @herman By "immutable" predicate, do you mean "stateless" or "pure-functional"? If so I'm not sure how it could be done, regardless of language. The exact selection algorithm requires the probability of selection to be different each time it's called, and *this* decision affects the probability of the *next* decision. To me, that's a clear call for mutable state, though I'd be interested to learn if there's a way to avoid this. Finally, I added an "inexact" variation that doesn't use counters, instead using a fixed probability. It still has random state though. – Stuart Marks Feb 22 '15 at 17:39
  • Just to add; thanks for taking the time to extend your answer further! It's really neat to see all the different approaches to this problem. – habitats Feb 23 '15 at 02:28
  • 1
    @habitats Glad you appreciate it. Even though Java 8 has been out for almost a year, it's still new enough that there's benefit in exploring a variety of different techniques. – Stuart Marks Feb 24 '15 at 02:26
  • 1
    Hi Stuart! Came here from [another question](https://stackoverflow.com/questions/49154257/java-collections-shuffle-weird-behaviour), where you correct me in a comment by linking to this answer. Btw, it's an amazing answer, thank you very much for taking time to share your knowledge. But I have a question... Couldn't you short-circuit, both the stream-based solution (with `takeWhile` instead of `filter`) and the imperative one (by adding an `if` with `break`) once `n` elements have been selected? – fps Mar 08 '18 at 14:40
  • 1
    @FedericoPeraltaSchaffner Well `takeWhile` hadn't been added at this point, but to tell you the truth, I wasn't thinking at all of short-circuiting. Yes, if the nth element has been selected, it's possible to terminate the loop or stream at that point. My hunch is that the tail of the input that can be cut off is usually going to be short, unless n is very small compared to N. I'd have to do some math to determine that, though. – Stuart Marks Mar 09 '18 at 05:21
  • 1
    @FedericoPeraltaSchaffner I ran some simulations and the average tail length seems to be *N / (n+1)*. For the case described in the other question, where *N* = 200,000 and *n* = 5,000 that means the average tail is about 40. Since that's about 0.02% of the length of the input, the savings from short-circuiting are negligible. Of course if *n* is small the savings could be much larger. – Stuart Marks Mar 09 '18 at 06:57
  • @StuartMarks wow, thank you very much for these numbers. This is somehow a counterintuitive problem, so measuring becomes very important. – fps Mar 09 '18 at 09:55
4

You could always create a "dumb" comparator, that will compare elements randomly in the list. Calling distinct() will ensure you that the elements are unique (from the queue).

Something like this:

static List<Integer> nDistinct(Collection<Integer> queue, int n) {
    final Random rand = new Random();
    return queue.stream()
                .distinct()
                .sorted(Comparator.comparingInt(a -> rand.nextInt()))
                .limit(n)
                .collect(Collectors.toList());
}

However I'm not sure it will be more efficient that putting the elements in the list, shuffling it and return a sublist.

static List<Integer> nDistinct(Collection<Integer> queue, int n) {
    List<Integer> list = new ArrayList<>(queue);
    Collections.shuffle(list);
    return list.subList(0, n);
}

Oh, and it's probably semantically better to return a Set instead of a List since the elements are distincts. The methods are also designed to take Integers, but there's no difficulty to design them to be generic. :)

Just as a note, the Stream API looks like a tool box that we could use for everything, however that's not always the case. As you see, the second method is more readable (IMO), probably more efficient and doesn't have much more code (even less!).

Alexis C.
  • 91,686
  • 21
  • 171
  • 177
  • 1
    Interesting approach, and slightly hacky. However, by the looks of it it will run O(n*lg n) as ooposed to O(n) like my second attempt, due to sorting (regardless of correctness). Shuffle runs in linear time. – habitats Feb 21 '15 at 22:52
  • 1
    @habitats Yes I think also. But you asked how to do it with Streams :-) – Alexis C. Feb 21 '15 at 22:54
  • 1
    Your last edit is similar to my second attempt, with the exception of the "forEach" which was the whole idea behind this. I wanted to perform a "forEach" on a random sub-set of the elements in the given collection:). But yeah, I'm not sure which is more efficient, stream().limit(n) or sublist(0,n), but as I'll want a the "forEach" I lean towards the limit! – habitats Feb 21 '15 at 23:54
  • 1
    @habitats Yes; it was just to point out that it's probably more readable than the hacky stream approach (nothing fancy!). But what do you want to do with this `forEach`? Maybe there's another way to resolve this problem. – Alexis C. Feb 21 '15 at 23:55
  • 1
    I want to perform a method on n randomly selected objects, which will alter the objects internal state. – habitats Feb 21 '15 at 23:56
  • 1
    @habitats I see; so yes it reduces to how to pick n random unique elements from the Queue. So I don't think there's another way to do it (maybe let's see for other answers). – Alexis C. Feb 22 '15 at 00:00
  • @habitats If you have a list to begin with, `subList` is preferable to a stream with `skip()` or `limit()` as the `subList` approach leaves the stream with a definite size, whereas the latter two do not. – Stuart Marks Feb 22 '15 at 06:13
  • 1
    Trying to sort with such a comparator is likely to produce exceptions for violating the contract with the actual implementation (TimSort). Even if it doesn’t, you can’t preclude that other implementations will do. Therefore, a no-go. – Holger Feb 23 '15 at 10:00
  • @Holger is right, but his concern can be addressed by changing the comparator to `Comparator.comparingInt(a -> System.identityHashCode(a))` (or just `Ordering.arbitrary()` with Guava), which returns a consistent but still pseudo-random result. – MikeFHay Mar 15 '16 at 08:03
  • @Holger you're right, and I didn't delete my performance nonsense quickly enough. I got confused because although the `limit` currently does not affect the `sorted` behaviour, there is a proposal to change that: http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039308.html But until that's implemented, the copy-and-shuffle solution will be faster. – MikeFHay Mar 15 '16 at 08:35
  • @MikeFHay: even then, it won’t be faster than the approach of my answer which shuffles only as much as required. And the other point still holds, the identity hash code doesn’t change so the next time you try to get *n* random elements from the same set of objects with such a solution, you’ll get the same objects. And the identity hash code isn’t necessarily as random as you might wish. – Holger Mar 15 '16 at 08:42
2

As an addendum to the shuffle approach of the accepted answer:

If you want to select only a few items from a large list and want to avoid the overhead of shuffling the entire list you can solve the task as follows:

public static <T> List<T> getRandom(List<T> source, int num) {
    Random r=new Random();
    for(int i=0; i<num; i++)
        Collections.swap(source, i, i+r.nextInt(source.size()-i));
    return source.subList(0, num);
}

What it does is very similar to what shuffle does but it reduces it’s action to having only num random elements rather than source.size() random elements…

Holger
  • 285,553
  • 42
  • 434
  • 765
  • 2
    Yes, if the source list contained unique elements, the returned list will have unique elements as well. As said, it works like `shuffle`, just doing re-ordering; the only difference is that this solution doesn’t shuffle the *entire* list, so the excluded part might contain (possibly lots of) untouched elements still being at their original position. – Holger Feb 23 '15 at 11:56
  • Ah yeah, right, it just took some time to wrap my head around the snippet. Great idea! For my use case n actually varies a lot and is very small most of the time, while length of the collection is very huge, so this is actually perfect (with the exception of not being a stream:p)! – habitats Feb 23 '15 at 12:48
1

You can use limit to solve your problem.

http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#limit-long-

Collections.shuffle(collection); 

int howManyDoYouWant = 10;
List<Integer> smallerCollection = collection
    .stream()
    .limit(howManyDoYouWant)
    .collect(Collectors.toList());
iFytil
  • 459
  • 3
  • 7
0

It should be clear that streaming the collection is not what you want.

Use the generate() and limit methods:

Stream.generate(() -> list.get(new Random().nextInt(list.size())).limit(3).forEach(...);
Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 3
    But you won't be sure to get distinct elements from the list with this. As the OP stated _"I'm attempting to retrieve n unique random elements "_ – Alexis C. Feb 21 '15 at 23:37
0
List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
int n = 4;
Random random = ThreadLocalRandom.current();

random.ints(0, collection.size())
        .distinct()
        .limit(n)
        .mapToObj(collection::get)
        .forEach(System.out::println);

This will of course have the overhead of the intermediate set of indexes and it will hang forever if n > collection.size().

If you want to avoid any non-constatn overhead, you'll have to make a stateful Predicate.

Misha
  • 27,433
  • 6
  • 62
  • 78
  • I realize I've confused a handful of people with my attempt at simplifying the problem by using integers and List as opposed to my actual implementation in question which uses arbitrary objects and a Queue. I will update the question once more. – habitats Feb 22 '15 at 01:04
0

If you want to process the whole Stream without too much hassle, you can simply create your own Collector using Collectors.collectingAndThen():

public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
    return Collectors.collectingAndThen(
      toList(),
      list -> {
          Collections.shuffle(list);
          return list.stream();
      });
}

But this won't perform well if you want to limit() the resulting Stream. In order to overcome this, one could create a custom Spliterator:

package com.pivovarit.stream;

import java.util.List;
import java.util.Random;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.function.Supplier;

public class ImprovedRandomSpliterator<T> implements Spliterator<T> {

    private final Random random;
    private final T[] source;
    private int size;

    ImprovedRandomSpliterator(List<T> source, Supplier<? extends Random> random) {
        if (source.isEmpty()) {
            throw new IllegalArgumentException("RandomSpliterator can't be initialized with an empty collection");
        }
        this.source = (T[]) source.toArray();
        this.random = random.get();
        this.size = this.source.length;
    }

    @Override
    public boolean tryAdvance(Consumer<? super T> action) {
        int nextIdx = random.nextInt(size);
        int lastIdx = size - 1;

        action.accept(source[nextIdx]);
        source[nextIdx] = source[lastIdx];
        source[lastIdx] = null; // let object be GCed
        return --size > 0;
    }

    @Override
    public Spliterator<T> trySplit() {
        return null;
    }

    @Override
    public long estimateSize() {
        return source.length;
    }

    @Override
    public int characteristics() {
        return SIZED;
    }
}

and then:

public final class RandomCollectors {

    private RandomCollectors() {
    }

    public static <T> Collector<T, ?, Stream<T>> toImprovedLazyShuffledStream() {
        return Collectors.collectingAndThen(
          toCollection(ArrayList::new),
          list -> !list.isEmpty()
            ? StreamSupport.stream(new ImprovedRandomSpliterator<>(list, Random::new), false)
            : Stream.empty());
    }

    public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
        return Collectors.collectingAndThen(
          toCollection(ArrayList::new),
          list -> {
              Collections.shuffle(list);
              return list.stream();
          });
    }
}

And then you could use it like:

stream
  .collect(toLazyShuffledStream()) // or toEagerShuffledStream() depending on the use case
  .distinct()
  .limit(42)
  .forEach( ... );

A detailed explanation can be found here.

Grzegorz Piwowarek
  • 13,172
  • 8
  • 62
  • 93
0

If you want a random sample of elements from a stream, a lazy alternative to shuffling might be a filter based on the uniform distribution:

...
import org.apache.commons.lang3.RandomUtils

// If you don't know ntotal, just use a 0-1 ratio 
var relativeSize = nsample / ntotal;

Stream.of (...) // or any other stream
.parallel() // can work in parallel
.filter ( e -> Math.random() < relativeSize )
// or any other stream operation
.forEach ( e -> System.out.println ( "I've got: " + e ) ); 
zakmck
  • 2,715
  • 1
  • 37
  • 53