If you are sure that you need n elements from random positions in the set (kind of like a statistical sampling), then you may want to consider just iterating through the set once and pick up the samples, by the desired probability, as you iterate through the set. This way is more efficient as you will iterate through the set only once.
The following program demonstrates the idea:
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.TreeSet;
public class SamplingFromSet {
public static void main(String[] args) {
Set<String> population = new TreeSet<>();
/*
* Populate the set
*/
final int popSize = 17;
for (int i=0; i<popSize; i++) {
population.add(getRandomString());
}
List<String> sample
= sampleFromPopulation(population, 3 /*sampleSize */);
System.out.println("population is");
System.out.println(population.toString());
System.out.println("sample is");
System.out.println(sample.toString());
}
/**
* Pick some samples for a population
* @param population
* @param sampleSize - number of samples
* @return
*/
private static <T>
List<T> sampleFromPopulation(Set<T> population
, int sampleSize) {
float sampleProb = ((float) sampleSize) / population.size();
List<T> sample = new ArrayList<>();
Iterator<T> iter = population.iterator();
while (iter.hasNext()) {
T element = iter.next();
if (random.nextFloat()<sampleProb) {
/*
* Lucky Draw!
*/
sample.add(element);
}
}
return sample;
}
private static Random random = new Random();
private static String getRandomString() {
return String.valueOf(random.nextInt());
}
}
Output of this program:
population is
[-1488564139, -1510380623, -1980218182, -354029751, -564386445, -57285541, -753388655, -775519772, 1538266464, 2006248253, 287039585, 386398836, 435619764, 48109172, 580324150, 64275438, 860615531]
sample is
[-57285541, -753388655, 386398836]
Update
The above program, however, has a caveat -- since the picking up
of samples in that one walk through the set is done by probability,
the returned sample
may, depending on your luck of the day,
have fewer or more samples than specified.
This problem, however, can remedied with a slight change of procedure,
which uses a slightly different method signature:
/**
* Pick some samples from a population
* @param population
* @param sampleSize - number of samples
* @param exactSize - a boolean to control whether or not
* the returned sample list must be of the exact size as
* specified.
* @return
*/
private static <T>
List<T> sampleFromPopulation(Set<T> population
, int sampleSize
, boolean exactSize);
Prevention of oversampling
In the one iteration through the population, we over sample a bit,
and then at the end we drop some samples if we do have too many.
Prevention of undersampling
Note also that, even with oversampling, there is a non-zero probability
that, at the end of the one iteration through the population, we still
get less samples than desired. If that happen (unlikely), we will recursively calling
the same method again as a second try. (This recursion has a probability approaching one
to terminate because it is very unlike that, in repeated recursive
call into the method, we consistently get undersampling.)
The following code implements the new sampleFromPopulation()
method:
private static <T>
List<T> sampleFromPopulation(Set<T> population
, int sampleSize
, boolean exactSize) {
int popSize = population.size();
double sampleProb = ((double) sampleSize) / popSize;
final double OVER_SAMPLING_MULIT = 1.2;
if (exactSize) {
/*
* Oversampling to enhance of chance of getting enough
* samples (if we then have too many, we will drop them
* later)
*/
sampleProb = sampleProb * OVER_SAMPLING_MULIT;
}
List<T> sample = new LinkedList<>(); // linked list for fast removal
Iterator<T> iter = population.iterator();
while (iter.hasNext()) {
T element = iter.next();
if (random.nextFloat()<sampleProb) {
/*
* Lucky Draw!
*/
sample.add(element);
}
}
int samplesTooMany = sample.size() - sampleSize;
if (!exactSize || samplesTooMany==0) {
return sample;
} else if (samplesTooMany>0) {
Set<Integer> indexesToRemoveAsSet = new HashSet<>();
for (int i=0; i<samplesTooMany; ) {
int candidate = random.nextInt(sample.size());
if (indexesToRemoveAsSet.add(candidate)) {
/*
* add() returns true if candidate was not
* previously in the set
*/
i++; // proceed to draw next index
}
}
List<Integer> indexesToRemoveAsList
= new ArrayList<>(indexesToRemoveAsSet);
Collections.sort(indexesToRemoveAsList
, (i1, i2) -> i2.intValue() - i1.intValue()); // desc order
/*
* Now we drop from the tail of the list
*/
for (Integer index : indexesToRemoveAsList) {
sample.remove((int) index); // remove by index (not by element)
}
return sample;
} else {
/*
* we were unluckly that we oversampling we still
* get less samples than specified, so here we call
* this very same method again recursively
*/
return sampleFromPopulation(population, sampleSize, exactSize);
}
}