Getting the nth element of a set several times in Java

Question

Is there any efficient way for getting the nth element of a set in Java? I know 2 ways of doing it: - By iterating through it until I reach the required element - By converting it to an ArrayList and getting the elements from that ArrayList The question is that is there any other way to get the nth element of it quickly. I would mainly need a feature like that for the TreeSets.

EDIT: For example if I want to select 1000 random elements from a 10 000 000 element long treemap or treeset, very frequently (i.e. every 2-3 seconds), then cloning it to an arraylist all the time is very inefficient, and iterating through so many elements is also inefficient.

Unless the set is a sorted set (e.g. a `TreeSet`) there is no notion of "nth". `TreeSet` on the other hand doesn't provide index access since that would need iteration anyways (much like accessing the nth element in a linked list would require iteration) and thus I'd just keep using the iteration method or keep track of which element is the nth during set modification operations. — Thomas, Jan 12 '17 at 14:26
Is there any way to get it by using some kind of special keys, which has a custom compareTo, equals and hashCode method? — gyurix, Jan 12 '17 at 14:34
Even if that would be possible (and almost everything is possible somehow) you wouldn't really gain anything: you'd have to know the exact layout of the tree, which index would correspond to the root node, whether the tree is reversed or not etc. and you'd have to track the lookup steps and calculate the current index from that - in the end you'd probably do as much if not more as you'd do when iterating. On the other hand using special keys like this would be quite error prone and actually break your tree - so I'd not do it. — Thomas, Jan 12 '17 at 14:41
I would like to see a solution for it first and then we could think about what will it gain :) By the way for example if I want to select 1000 random elements from a 10 000 000 element long treeset, very frequently, then cloning it to an arraylist all the time is very inefficient, and iterating through so many elements is also inefficient. — gyurix, Jan 12 '17 at 14:48
"I would like to see a solution for it first" - I'd rather not invest into a solution that's bound to do more harm than solve anything :) I'll add an answer as to what challenges you'd face (at least some of them) but the best option would probably to take a step back and think about whether a set is the right structure here. Can you edit your question and add some more requirements like what you need that for, why you need to use a set (or why you think you do) etc.? — Thomas, Jan 12 '17 at 15:08
Does the set change its structure over time or are its elements 'fixed'? (I mean if you add or remove elements over time) — fps, Jan 12 '17 at 15:09
Yes of course I am planning to add/remove elements all the time. If I would have a fixed amount of data, then I could just use arrays and not sets :D — gyurix, Jan 12 '17 at 19:41
If the real problem is to select *n* elements at random from a collection containing *N* elements, then perhaps this will help: http://stackoverflow.com/a/28655112/1441122 — Stuart Marks, Jan 12 '17 at 19:57
Yep, Stuart Marks, exactly, that's the most inefficient way for it. That solution is good for small amount of data, like selecting 20 random elements from 150 elements, but definitely not for large amount of data. — gyurix, Jan 12 '17 at 22:48

score 1 · Answer 1 · answered Jan 12 '17 at 15:24

If your requirement is to select random elements out of a rather huge set then you should ask yourself whether a set is the best fit for that.

If you want to use the built-in sets there are several challenges you'd face.

TreeSet

A TreeSet is an ordered set and thus would allow you to access the n-th element. However, you'd have to iterate to position n since there is no array that allows random access like an ArrayList would. As the name implies the nodes in a TreeSet form a tree and the nodes most likely are stored anywhere in memory. Because of this to get the n-th element you'd have to start at the first node and hop from node to node until you reach position n - which is similar to how you'd do it in a LinkedList.

If all you want to do is select a random element there are several options:

If the set doesn't change (or not often) you could create a matching array or ArrayList and use random access.
Iterate over the set for a random number of times.
Generate a random key and look up the next higher/lower element, e.g. by using tailSet(randomKey) and getting the first element of that tail set. Of course you'd have to handle random keys that are outside the elements' range. That way a lookup would basically be a binary search.

HashSet

HashSets basically consist of 2 things: an array of buckets and a linked list or tree for collisions, i.e. if 2 elements would be mapped to the same bucket. Getting a random element might then be done by accessing a random bucket (this would be random access) and then iterating over the elements in that bucket for a random number of times.

Is there any way to access these buckets from the default java HashSets or I need to make a custom HashSet implementation for it? — gyurix, Jan 12 '17 at 19:39
@gyurix you can have a look at the sources but I'd assume you can't access them from the outside so you'd at least have to create a subclass of `HashSet` (unless the buckets are private in which case you'd need another implementation). — Thomas, Jan 13 '17 at 08:24

leeyuiwah · Accepted Answer · 2017-01-14T17:39:44.687

If you are sure that you need n elements from random positions in the set (kind of like a statistical sampling), then you may want to consider just iterating through the set once and pick up the samples, by the desired probability, as you iterate through the set. This way is more efficient as you will iterate through the set only once.

The following program demonstrates the idea:

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.TreeSet;

public class SamplingFromSet {

    public static void main(String[] args) {
        Set<String> population = new TreeSet<>();

        /*
         * Populate the set
         */
        final int popSize = 17;
        for (int i=0; i<popSize; i++) {
            population.add(getRandomString());
        }

        List<String> sample 
            = sampleFromPopulation(population, 3 /*sampleSize */);

        System.out.println("population is");
        System.out.println(population.toString());
        System.out.println("sample is");
        System.out.println(sample.toString());

    }


    /**
     * Pick some samples for a population
     * @param population
     * @param sampleSize - number of samples
     * @return
     */
    private static <T> 
    List<T> sampleFromPopulation(Set<T> population
                                    , int sampleSize) {
        float sampleProb = ((float) sampleSize) / population.size();
        List<T> sample = new ArrayList<>();
        Iterator<T> iter = population.iterator();
        while (iter.hasNext()) {
            T element = iter.next();
            if (random.nextFloat()<sampleProb) {
                /*
                 * Lucky Draw!
                 */
                sample.add(element);
            }
        }
        return sample;
    }


    private static Random random = new Random();    

    private static String getRandomString() {
        return String.valueOf(random.nextInt());
    }
}

Output of this program:

population is
[-1488564139, -1510380623, -1980218182, -354029751, -564386445, -57285541, -753388655, -775519772, 1538266464, 2006248253, 287039585, 386398836, 435619764, 48109172, 580324150, 64275438, 860615531]
sample is
[-57285541, -753388655, 386398836]

Update

The above program, however, has a caveat -- since the picking up of samples in that one walk through the set is done by probability, the returned sample may, depending on your luck of the day, have fewer or more samples than specified. This problem, however, can remedied with a slight change of procedure, which uses a slightly different method signature:

/**
 * Pick some samples from a population
 * @param population
 * @param sampleSize - number of samples
 * @param exactSize - a boolean to control whether or not
 *   the returned sample list must be of the exact size as
 *   specified.
 * @return
 */
private static <T> 
List<T> sampleFromPopulation(Set<T> population
                                , int sampleSize
                                , boolean exactSize);

Prevention of oversampling

In the one iteration through the population, we over sample a bit, and then at the end we drop some samples if we do have too many.

Prevention of undersampling

Note also that, even with oversampling, there is a non-zero probability that, at the end of the one iteration through the population, we still get less samples than desired. If that happen (unlikely), we will recursively calling the same method again as a second try. (This recursion has a probability approaching one to terminate because it is very unlike that, in repeated recursive call into the method, we consistently get undersampling.)

The following code implements the new sampleFromPopulation() method:

private static <T> 
List<T> sampleFromPopulation(Set<T> population
                                , int sampleSize
                                , boolean exactSize) {
    int popSize = population.size();
    double sampleProb = ((double) sampleSize) / popSize;

    final double    OVER_SAMPLING_MULIT = 1.2;
    if (exactSize) {
        /*
         * Oversampling to enhance of chance of getting enough
         * samples (if we then have too many, we will drop them 
         * later)
         */
        sampleProb = sampleProb * OVER_SAMPLING_MULIT;
    }
    List<T> sample = new LinkedList<>(); // linked list for fast removal
    Iterator<T> iter = population.iterator();
    while (iter.hasNext()) {
        T element = iter.next();
        if (random.nextFloat()<sampleProb) {
            /*
             * Lucky Draw!
             */
            sample.add(element);
        }
    }
    int samplesTooMany = sample.size() - sampleSize;
    if (!exactSize || samplesTooMany==0) {
        return sample;
    } else if (samplesTooMany>0) {
        Set<Integer> indexesToRemoveAsSet = new HashSet<>();
        for (int i=0; i<samplesTooMany; ) {
            int candidate = random.nextInt(sample.size());
            if (indexesToRemoveAsSet.add(candidate)) {
                /*
                 * add() returns true if candidate was not 
                 * previously in the set
                 */
                i++; // proceed to draw next index
            }
        }
        List<Integer> indexesToRemoveAsList 
            = new ArrayList<>(indexesToRemoveAsSet);
        Collections.sort(indexesToRemoveAsList
                , (i1, i2) -> i2.intValue() - i1.intValue()); // desc order  
        /*
         * Now we drop from the tail of the list
         */
        for (Integer index : indexesToRemoveAsList) {
            sample.remove((int) index); // remove by index (not by element)
        }
        return sample;
    } else { 
        /*
         * we were unluckly that we oversampling we still
         * get less samples than specified, so here we call
         * this very same method again recursively
         */
        return sampleFromPopulation(population, sampleSize, exactSize);
    }
}

That's a pretty good solution ;) Just the sample size might not be the wanted one but be a few more or a few less. Can you improve your code to fix that issue efficiently? — gyurix, Jan 14 '17 at 00:38
Can you elaborate what do you mean by "a few more or a few less"? — leeyuiwah, Jan 14 '17 at 01:32
@gyurix -- Okay I see what you mean. I have updated my answer above to address the issue. — leeyuiwah, Jan 14 '17 at 17:35

Getting the nth element of a set several times in Java

2 Answers2