4

I have a HashSet in which I have 10000 elements. I want to extract random 100 elements from that HashSet. So I thought I can use shuffle on the set but it doesn't work.

Set<String> users = new HashSet<String>();

// for randomness, but this doesn't work
Collections.shuffle(users, new Random(System.nanoTime()));  

// and use for loop to get 100 elements

I cannot use shuffle now, is there any other best way to get 100 random elements from HashSet in Java?

user1950349
  • 4,738
  • 19
  • 67
  • 119
  • 4
    Your code won’t compile as [`Collections.shuffle`](https://docs.oracle.com/javase/6/docs/api/java/util/Collections.html#shuffle%28java.util.List%29) requires a list. So try to create a `List` from your set and then shuffle this list. – Robin Krahl Apr 16 '15 at 19:37
  • Shuffle the result of `users.toArray()`. – pjs Apr 16 '15 at 19:43

3 Answers3

8

Without building a new list, you can implement the following algorithm:

n = 100
d = 10000  # length(users)
for user in users:
    generate a random number p between 0 and 1
    if p <= n / d:
       select user
       n -= 1
    d -= 1

As you iterate through the list, you decrease the probability of future elements from being chosen by decreasing n, but at the the same time increase the probability by decreasing d. Initially, you would have a 100/10000 chance of choosing the first element. If you decide to take that element, you would have a 99/9999 chance of choosing the second element; if you don't take the first one, you'll have a slightly better 100/9999 chance of picking the second element. The math works out so that in the end, every element has a 100/10000 chance of being selected for the output.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • That seems to be a valid solution, although do you know where can I find mathematical proof? I saw similar answer http://stackoverflow.com/a/48089/926907 but from the comments for me it is not completely clear whether this is a correct approach or not. – Dmitry Zaytsev Apr 16 '15 at 20:42
  • Also, I guess it should be `p <= n / d`. For `n=1, d=1, p=1` it won't select the single element. – Dmitry Zaytsev Apr 16 '15 at 20:45
  • 1
    I can't seem to find a proof, but it's a fairly straightforward application of expected value. For the first element, it's obviously 100/1000. For the second algorithm, it's (100/10000)(99/9999) (probability that you choose the first element times the probability of choosing the second) plus (9900/10000)(100/9999) (probability that you *don't* choose the first times the probability of choosing the second), which should simplify to 100/10000. Similar (but increasingly more convoluted) math applies to the remaining elements. – chepner Apr 16 '15 at 22:43
  • 1
    The method is in Knuth Vol 2 2nd Ed, Algorithm 3.4.2 S, though unfortunately the proof is in the exercises (groan!) – rossum Apr 16 '15 at 23:13
6

Shuffling the collection implies that there is some defined order of elements within, so elements can be reordered. HashSet is not an ordered collection as there is no order of elements inside (or rather details of the ordering are not exposed to the user). Therefore implementation wise it's does not makes much sense to shuffle HashSet.

What you can do is add all elements from your set to the ArrayList, shuffle it and get your results.

List<String> usersList = new ArrayList<String>(users);
Collections.shuffle(usersList);
// get 100 elements out of the list
Dmitry Zaytsev
  • 23,650
  • 14
  • 92
  • 146
  • So you are saying I should make a List out of set and then do that? – user1950349 Apr 16 '15 at 19:47
  • @user1950349 answer given by chepner seems to produce correct results. If amount of users is going to be small I would suggest to convert set to `List`. Otherwise, consider chepner's solution. – Dmitry Zaytsev Apr 16 '15 at 20:44
-1

The java.lang.HashSet has an order so you can't shuffle Sets. If you must use Sets you might iterate over the Set and stop on a random position.

Pseudocode:

Set randomUsers = new HashSet<String>();
Random r = new Random();
Iterator it = users.iterator(); 
numUsersNeeded = 100;
numUsersLeft = users.size();
while (it.hasNext() && randomUsers.size() < 100) {
  String user = it.next();
  double prop = (double)numUsersNeeded / numUsersLeft;
  --numUsersLeft;
  if (prop > r.nextDouble() && randomUsers.add(user)) { 
    --numUsersNeeded;
  }
}

You might repeat this because there is no garantiy that you fetch 100 elements.

If memory is no issue you can create an array and pick 100 random elements:

Pseudocode II:

Object userArray[] = user.toArray();
Set<String> randoms = new HashSet<String>();
while(randoms.size() != 100) {
  int randomUser = userArray[new Random().nexInt(10000)];
  randoms.add(randomUser);
}
TobiSH
  • 2,833
  • 3
  • 23
  • 33
  • 1
    That won't be a uniform distribution then - last elements have less probability of being selected – Dmitry Zaytsev Apr 16 '15 at 20:41
  • You're right. Thanks for pointing that out. I adjusted the first code to give the first elements a lower properbility to be selected. As far as I can see we should now have a uniform distribution. – TobiSH Apr 17 '15 at 05:13
  • Your second Pseudocode is also not correct - it might select same element from array several times in a row. That will result in less than 100 elements in output set. – Dmitry Zaytsev Apr 17 '15 at 07:31