How do I properly de-duplicate a list of floats?

Question

I would like to de-duplicate a list of Floats while specifying an appropriate epsilon for precision.

Here is the solution that I started writing before I realized that there are so many pitfalls. For example, this solution has problems with binning e.g. if I set epsilon to 1.0 and my input is {0.9, 1.0, 2.0, 3.0}, I get {0.9, 2.0}. However, if my input is {1.0, 2.0, 3.0}, I get {1.0, 3.0}.

Another issue is that it is unclear what is the best way to handle values like NaN, infinity, -0.0f, etc. so that this function works in many general use cases (perhaps there should be an optional parameter that customizes this behavior?).

I am sure there are other corner cases as well.

// Suffers from binning issues
public static List<Float> dedup(List<Float> floats, float epsilon) {
    List<Float> sortedFloats = new ArrayList<Float>();
    sortedFloats.addAll(floats);
    Collections.sort(sortedFloats);

    List<Float> dedupedList = new ArrayList<Float>();
    for (Float f : sortedFloats) {
        if (dedupedList.size() == 0) {
            dedupedList.add(f);
        } else {
            Float previousValue = dedupedList.get(dedupedList.size() - 1);
            if (Math.abs(previousValue - f) >= epsilon) {
                dedupedList.add(f);
            }
        }
    }
    return dedupedList;
}

Maybe [this](http://stackoverflow.com/questions/356807/java-double-comparison-epsilon) will help a bit — Idos, Mar 02 '16 at 07:55
I suggest using BigDecimal instead of float. Also here's what you need to know about floating-point arithmetic: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html — Youssef Lahoud, Mar 02 '16 at 07:56
@YoussefLahoud though that might not make any difference in this case. — Peter Lawrey, Mar 02 '16 at 09:45
I suggest you add the Float values to a TreeSet which has a comparator which treats values inside the epsilon as equal, and will thus be dropped. This is an O(N log N) operation. — Peter Lawrey, Mar 02 '16 at 09:46
When you have multiple values which are the same based on epsilon, you have to decide how you want to de-dup them i.e. which would be left. — Peter Lawrey, Mar 02 '16 at 09:48
You could do a groupingBy function though the order you process the values would still matter (you could sort them first to ensure the result is stable) you could de-dup them at the end once you see all the possible values. e.g. if the epsilon is 1 you could pick the values which is the closest to a multiple of 1. — Peter Lawrey, Mar 02 '16 at 09:50
@PeterLawrey Would such a Comparator meet the requirement of imposing a total order? You could have `a.compareTo(b)==0` and `b.compareTo(c)==0` without `a.compareTo(c)` being zero. — Patricia Shanahan, Mar 02 '16 at 12:50
@PatriciaShanahan correct, however if you sort the values first, the outcome would at least be reproducable. A random ordering wouldn't. — Peter Lawrey, Mar 02 '16 at 13:05

How do I properly de-duplicate a list of floats?

0 Answers0