Distinct selection, even with a custom EqualityComparer, still leaves duplicate entries

Question

I am fairly lost with this script - I don't get it - why does it leave duplicate entries?

private static float GenerateMedian(IEnumerable<Collider> items, KDAxis axis)
{
    float[] allValues = items.SelectMany(AxisSelector(axis)).ToArray();
    Debug.LogFormat("{0} all values for {1} items: {2}.", allValues.Length, items.Count(), string.Join(", ", allValues.Select(v => v.ToString("F10")).ToArray()));
    #if BASIC_DISTINCT
    float[] values = allValues.Distinct().OrderBy(f => f).ToArray();
    #else
    float[] values = allValues.Distinct(new KDFloatComparer(0.0001f)).OrderBy(f => f).ToArray();
    #endif
    Debug.LogFormat("{0} distinct values for {1} items: {2}.", values.Length, items.Count(), string.Join(", ", values.Select(v => v.ToString("F10")).ToArray()));

    int medianIndex = Mathf.CeilToInt(values.Length / 2f) - 1;
    float medianValue = values[medianIndex];

    Debug.LogFormat("Median index: {0} (left: {1}; right: {2}) value: {3}", medianIndex, medianIndex + 1, values.Length - 1 - medianIndex, medianValue);

    return medianValue;
}

private static Func<Collider, IEnumerable<float>> AxisSelector(KDAxis axis)
{
    switch (axis)
    {
        case KDAxis.X:
            return XAxisSelector;

        case KDAxis.Y:
            return YAxisSelector;

        case KDAxis.Z:
            return ZAxisSelector;
    }

    return XAxisSelector;
}

private static IEnumerable<float> XAxisSelector(Collider collider)
{
    yield return collider.bounds.max.x;
    yield return collider.bounds.min.x;
}

private static IEnumerable<float> YAxisSelector(Collider collider)
{
    yield return collider.bounds.max.y;
    yield return collider.bounds.min.y;
}

private static IEnumerable<float> ZAxisSelector(Collider collider)
{
    yield return collider.bounds.max.z;
    yield return collider.bounds.min.z;
}

Provides this output:

28 all values for 14 items: 3.0000000000, 2.0000000000, 11.0000000000, -11.0000000000, -5.0000010000, -10.0000000000, 3.0000000000, 2.0000000000, 3.0000000000, 2.0000000000, 11.0000000000, -11.0000000000, -10.0000000000, -11.0000400000, 3.0000000000, 2.0000000000, 7.0000000000, 6.0000000000, -7.0000000000, -10.0000000000, 10.0000000000, -10.0000000000, 11.0000000000, 9.9999550000, -8.0000000000, -9.9999980000, 3.0000000000, 2.0000000000.
20 distinct values for 14 items: -11.0000400000, -11.0000000000, -10.0000000000, -10.0000000000, -9.9999980000, -8.0000000000, -7.0000000000, -5.0000010000, 2.0000000000, 2.0000000000, 2.0000000000, 3.0000000000, 3.0000000000, 3.0000000000, 6.0000000000, 7.0000000000, 9.9999550000, 10.0000000000, 11.0000000000, 11.0000000000.

And it clearly contains duplicates - for instance the 3 x 2.0 and 3 x 3.0.

Even if I were to implement a custom float comparer, and feed it into Distinct() with new KDFloatComparer(0.0001f):

public class KDFloatComparer : EqualityComparer<float>
{
    public readonly float InternalEpsilon = 0.001f;

    public KDFloatComparer(float epsilon) : base()
    {
        InternalEpsilon = epsilon;
    }

    // http://stackoverflow.com/a/31587700/393406
    public override bool Equals(float a, float b)
    {
        float absoluteA = Math.Abs(a);
        float absoluteB = Math.Abs(b);
        float absoluteDifference = Math.Abs(a - b);

        if (a == b) 
        {
            return true;
        } 
        else if (a == 0 || b == 0 || absoluteDifference < float.Epsilon) 
        {
            // a or b is zero or both are extremely close to it.
            // Relative error is less meaningful here.
            return absoluteDifference < InternalEpsilon;
        } 
        else 
        { 
            // Use relative error.
            return absoluteDifference / (absoluteA + absoluteB) < InternalEpsilon;
        }

        return true;
    }

    public override int GetHashCode(float value)
    {
        return value.GetHashCode();
    }
}

The result is exactly the same.

I did try to replicate the scenario over on csharppad.com - it didn't leave duplicates. Though, I didn't use the SelectMany approach, I made raw arrays with the reported ToString("F10") values, which makes me think that the problem is with floating point precision, however, no matter how I have implemented the EqualityComparer (had some custom variations before attempting to use the SO one), I cannot seem to nail it.

How could I fix this?

The cause is the inexactitude at floats, but your custom comparer should avoid that problem, can you post the code where you use the custom comparer? — Gusman, Feb 04 '16 at 12:51
@Gusman, for the purposes of saving space, I edited the `GenerateMedian` for this question to include a compiler definition check that would determine which to use. I'm just chaining it like `Distinct(new KDFloatComparer(0.0001f))` instead of plain `Distinct()`. — tomsseisums, Feb 04 '16 at 12:58

usr · Accepted Answer · 2016-02-04T13:15:20.853

2

Your Equals is broken because it does not satisfy triangle inequality. It must be that a == b && b == c ==> a == c. This is not the case thanks to the epsilon comparison.

Really, this does not make sense. If you have numbers new [] { 0, epsilon, epsilon * 2 }, which ones of these three numbers do you want to keep?! You need to define this better and use a different algorithm.

When you violate the contracts of Equals and GetHashCode you get undefined behavior.

Another problem is that some values with unequal hash code will compare equal here.

I did try to replicate the scenario over on csharppad.com - it didn't leave duplicates

Undefined behavior sometimes means getting correct results.

edited Feb 04 '16 at 13:15

answered Feb 04 '16 at 13:02

usr

168,620
35
240
369

_UB_? And are there some advanced guides/theories around on how to implement such a comparison? – tomsseisums Feb 04 '16 at 13:12
1

https://msdn.microsoft.com/en-us/library/bsc2ak47(v=vs.110).aspx `The following statements must be true for all implementations of the Equals(Object) method`. There's surely a list for hash codes, too. In this case I'd consider sorting the list, traversing it front to back and applying the equals logic to adjacent elements. – usr Feb 04 '16 at 13:16

score 1 · Answer 2 · edited Feb 04 '16 at 14:39

I have created an small console project to test it:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace TestEqual
{
    class Program
    {

        static float[] values = new float[] { 3.0000000000f, 2.0000000000f, 11.0000000000f, -11.0000000000f, -5.0000010000f, -10.0000000000f, 3.0000000000f, 2.0000000000f, 3.0000000000f, 2.0000000000f, 11.0000000000f, -11.0000000000f, -10.0000000000f, -11.0000400000f, 3.0000000000f, 2.0000000000f, 7.0000000000f, 6.0000000000f, -7.0000000000f, -10.0000000000f, 10.0000000000f, -10.0000000000f, 11.0000000000f, 9.9999550000f, -8.0000000000f, -9.9999980000f, 3.0000000000f, 2.0000000000f };

        static void Main(string[] args)
        {
            var distinct = values.Distinct(new KDFloatComparer(0.001f)).OrderBy(d => d).ToArray();

            Console.WriteLine("Valores distintos: ");

            foreach (var f in distinct)
                Console.WriteLine(f);

            Console.ReadKey();
        }

        public class KDFloatComparer : EqualityComparer<float>
        {
            public readonly float InternalEpsilon = 0.001f;

            public KDFloatComparer(float epsilon)
                : base()
            {
                InternalEpsilon = epsilon;
            }

            // http://stackoverflow.com/a/31587700/393406
            public override bool Equals(float a, float b)
            {
                float absoluteA = Math.Abs(a);
                float absoluteB = Math.Abs(b);
                float absoluteDifference = Math.Abs(a - b);

                if (a == b)
                {
                    return true;
                }
                else if (a == 0 || b == 0 || absoluteDifference < InternalEpsilon)
                {
                    // a or b is zero or both are extremely close to it.
                    // Relative error is less meaningful here.
                    return absoluteDifference < InternalEpsilon;
                }
                else
                {
                    // Use relative error.
                    return absoluteDifference / (absoluteA + absoluteB) < InternalEpsilon;
                }

                return true;
            }

            public override int GetHashCode(float value)
            {
                return value.GetHashCode();
            }
        }

        public class FComparer : IEqualityComparer<float>
        {

            public bool Equals(float x, float y)
            {

                var dif = Math.Abs(x - y);

                if ((x == 0 || y == 0) && dif < float.Epsilon)
                    return true;

                if (Math.Sign(x) != Math.Sign(y))
                    return false;

                return dif < float.Epsilon;
            }

            public int GetHashCode(float obj)
            {
                return obj.GetHashCode();
            }
        }

    }
}

The result under Linux/Mono V4.0.1 where these:

Valores distintos: -11,00004 -11 -10 -9,999998 -8 -7 -5,000001 2 3 6 7 9,999955 10 11

So the only thing I can think is that your mono version has float math errors, indeed there were some old versions which did had some problems with it.

Try to update your mono version to the latest one, even better, compile it from the latest source on your machine.

Also, I have included an smaller comparer which leads to the same results.

EDIT: I also have corrected your comparer, in one place you were using InternalEpsilon and in other float.Epsilon, float.Epsilon is 1,401298E-45, which is not representable in your strings as they only have nine decimal places, if there were a discrepance inferior to 0.000000001 you did not see it as it was cropped.

EDIT: It seems Distinct is only executing Equals of the comparer only if the hash code is the same, so as every float has a different hash code Equals is never being executed.

This example is working 100% with random numbers generated.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace TestEqual
{
    class Program
    {
        static void Main(string[] args)
        {

            Random rnd = new Random();

            List<float> numbers = new List<float>();

            for(int buc = 0; buc < 1000; buc++)
                numbers.Add((float)rnd.NextDouble());

            var distinct = numbers.OrderBy(d => d).Distinct(new FComparer()).OrderBy(d => d).ToArray();

            Console.WriteLine(float.Epsilon);

            Console.WriteLine("Valores distintos: ");

            foreach (var f in distinct)
                Console.WriteLine(f);

            foreach (var f in distinct)
            {

                for (int buc = 0; buc < distinct.Length; buc++)
                    if (Math.Abs(f - distinct[buc]) < 0.001f && f != distinct[buc])
                        Console.WriteLine("Duplicate");

            }

            Console.ReadKey();
        }

        public class FComparer : IEqualityComparer<float>
        {

            public bool Equals(float x, float y)
            {

                var dif = Math.Abs(x - y);

                if ((x == 0 || y == 0) && dif < 0.001f)
                    return true;

                if (Math.Sign(x) != Math.Sign(y))
                    return false;

                return dif < 0.001f;
            }

            public int GetHashCode(float obj)
            {
                //This is the key, if GetHashCode is different then Equals is not called
                return 0;
            }
        }

    }
}

I changed the code to use these explicit values instead of accumulated ones and it did return correct results. I am starting to think that the comparison with explicitly defined values is flawed. When the values come from the underlying 3D engine or wherever they come from in Unity3D, they probably have some minor discrepancies, even though they are _equal_. Those discrepancies invalidate the `a == b` check, which in explicit case will always be truthy. — tomsseisums, Feb 04 '16 at 13:27
Then the problem is that you're using in one place float.Epsilon and in other InternalEpsilon, change the code to use both InternalEpsilon and it will yield the correct results for values with a difference inferior to 0.001f — Gusman, Feb 04 '16 at 13:31
As I already said - the problem doesn't exist for explicit values (they are flawed!). The actual values have the problem. And while the actual values are _equal_, they most probably pose some minor discrepancies which is where the comparer fails. The line you corrected has no problems - it simply checks if the the two values are unequal by the smallest possible floating-point margin, and then aligns the result to the user specified permitted difference - the `InternalEpsilon`. — tomsseisums, Feb 04 '16 at 13:41
You gave an InternalEpsilon of 0.001 but that line would make to reject the numbers as different if it's difference is lesser to 1,401298E-45. Also, as I said, your output to console is cropping the result to 9 decimals, you aren't really sure if they're different. Try to use my comparer with 0.001f instead of float.Epsilon and you will get the results you spect. — Gusman, Feb 04 '16 at 13:46
Might be, might be, excuse me, am not paying too much attention to this explicit test case as it doesn't have the problem. Am busy finding out the internal workings of `Distinct()` to understand what the f* is going on. — tomsseisums, Feb 04 '16 at 14:02
Your problem does not have nothing to do with Distinct, the problem is in the comparer, try the FComparer on my example replacing float.Epsilon with 0.001f and your problem should be corrected. — Gusman, Feb 04 '16 at 14:04
Then you have some error in other place, there's no difference between generated numbers and the one's used on initialization, I'm going to update my answer with an RNG generated values to prove it (or may be find there is a real error on mono, but I don't belive). — Gusman, Feb 04 '16 at 14:15
Found the problem, DIstinct only calls Equals if the hash code is the same, and each float has a different hash code. Updated the answer with the correct solution. — Gusman, Feb 04 '16 at 14:29

Distinct selection, even with a custom EqualityComparer, still leaves duplicate entries

2 Answers2