2

I would like to de-duplicate a list of Floats while specifying an appropriate epsilon for precision.

Here is the solution that I started writing before I realized that there are so many pitfalls. For example, this solution has problems with binning e.g. if I set epsilon to 1.0 and my input is {0.9, 1.0, 2.0, 3.0}, I get {0.9, 2.0}. However, if my input is {1.0, 2.0, 3.0}, I get {1.0, 3.0}.

Another issue is that it is unclear what is the best way to handle values like NaN, infinity, -0.0f, etc. so that this function works in many general use cases (perhaps there should be an optional parameter that customizes this behavior?).

I am sure there are other corner cases as well.

// Suffers from binning issues
public static List<Float> dedup(List<Float> floats, float epsilon) {
    List<Float> sortedFloats = new ArrayList<Float>();
    sortedFloats.addAll(floats);
    Collections.sort(sortedFloats);

    List<Float> dedupedList = new ArrayList<Float>();
    for (Float f : sortedFloats) {
        if (dedupedList.size() == 0) {
            dedupedList.add(f);
        } else {
            Float previousValue = dedupedList.get(dedupedList.size() - 1);
            if (Math.abs(previousValue - f) >= epsilon) {
                dedupedList.add(f);
            }
        }
    }
    return dedupedList;
}
Community
  • 1
  • 1
J Wang
  • 2,075
  • 1
  • 20
  • 26
  • Maybe [this](http://stackoverflow.com/questions/356807/java-double-comparison-epsilon) will help a bit – Idos Mar 02 '16 at 07:55
  • I suggest using BigDecimal instead of float. Also here's what you need to know about floating-point arithmetic: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – Youssef Lahoud Mar 02 '16 at 07:56
  • 1
    @YoussefLahoud though that might not make any difference in this case. – Peter Lawrey Mar 02 '16 at 09:45
  • I suggest you add the Float values to a TreeSet which has a comparator which treats values inside the epsilon as equal, and will thus be dropped. This is an O(N log N) operation. – Peter Lawrey Mar 02 '16 at 09:46
  • When you have multiple values which are the same based on epsilon, you have to decide how you want to de-dup them i.e. which would be left. – Peter Lawrey Mar 02 '16 at 09:48
  • You could do a groupingBy function though the order you process the values would still matter (you could sort them first to ensure the result is stable) you could de-dup them at the end once you see all the possible values. e.g. if the epsilon is 1 you could pick the values which is the closest to a multiple of 1. – Peter Lawrey Mar 02 '16 at 09:50
  • 1
    @PeterLawrey Would such a Comparator meet the requirement of imposing a total order? You could have `a.compareTo(b)==0` and `b.compareTo(c)==0` without `a.compareTo(c)` being zero. – Patricia Shanahan Mar 02 '16 at 12:50
  • @PatriciaShanahan correct, however if you sort the values first, the outcome would at least be reproducable. A random ordering wouldn't. – Peter Lawrey Mar 02 '16 at 13:05

0 Answers0