0

I am having a problem with a custom struct and overloading linq's except method to remove duplicates.

My struct is as follows:

public struct hashedFile
{
    string _fileString;
    byte[] _fileHash;

    public hashedFile(string fileString, byte[] fileHash)
    {
        this._fileString = fileString;
        this._fileHash = fileHash;
    }

    public string FileString { get { return _fileString; } }
    public byte[] FileHash { get { return _fileHash; } }
}

Now, the following code works fine:

    public static void test2()
    {
        List<hashedFile> list1 = new List<hashedFile>();
        List<hashedFile> list2 = new List<hashedFile>();

        hashedFile one = new hashedFile("test1", BitConverter.GetBytes(1));
        hashedFile two = new hashedFile("test2", BitConverter.GetBytes(2));
        hashedFile three = new hashedFile("test3", BitConverter.GetBytes(3));
        hashedFile threeA = new hashedFile("test3", BitConverter.GetBytes(4));
        hashedFile four = new hashedFile("test4", BitConverter.GetBytes(4));

        list1.Add(one); 
        list1.Add(two);
        list1.Add(threeA);
        list1.Add(four);

        list2.Add(one);
        list2.Add(two);
        list2.Add(three);

        List<hashedFile> diff = list1.Except(list2).ToList();

        foreach (hashedFile h in diff)
        {
            MessageBox.Show(h.FileString + Environment.NewLine + h.FileHash[0].ToString("x2"));
        }

    }

This code shows "threeA" and "four" just fine. But if I do the following.

public static List<hashedFile> list1(var stuff1)
{
//Generate a List here and return it
}

public static List<hashedFile> list2(var stuff2)
{
//Generate a List here and return it
}

List<hashedFile> diff = list1.except(list2);

"diff" becomes an exact copy of "list1". I should also mention that I am sending a byte array from ComputeHash from System.Security.Cryptography.MD5 to the byte fileHash in the list generations.

Any ideas on how to overload either the Except or GetHashCode method for linq to successfully exclude the duplicate values from list2?

I'd really appreciate it! Thanks! ~MrFreeman

EDIT: Here was how I was originally trying to use List<hashedFile> diff = newList.Except(oldList, new hashedFileComparer()).ToList();

class hashedFileComparer : IEqualityComparer<hashedFile>
{

    public bool Equals(hashedFile x, hashedFile y)
    {
        if (Object.ReferenceEquals(x, y)) return true;

        if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
            return false;

        return x.FileString == y.FileString && x.FileHash == y.FileHash;
    }

    public int GetHashCode(hashedFile Hashedfile)
    {
        if (Object.ReferenceEquals(Hashedfile, null)) return 0;

        int hashFileString = Hashedfile.FileString == null ? 0 : Hashedfile.FileString.GetHashCode();
        int hashFileHash = Hashedfile.FileHash.GetHashCode();
        int returnVal = hashFileString ^ hashFileHash;
        if (Hashedfile.FileString.Contains("blankmusic") == true)
        {
            Console.WriteLine(returnVal.ToString());
        }

        return returnVal;
    }

}
MrFreeman
  • 47
  • 7
  • If the code above for `HashedFile` type is all you have than behavior is expected since there is no `Equal`/`GetHashCode` (because arrays and other .Net collection a not compared by value - so you need to write it yourself). Side note: using `struct` is often personal choice when you want some pain, make sure to understand what you are doing. – Alexei Levenkov Oct 07 '13 at 02:24
  • I overloaded my own IEqualityComparer, but it yields the exact same results. In fact, stepping through the code, it never even touches the overloaded "Equals" when using the Except method. – MrFreeman Oct 07 '13 at 02:27
  • Why is `HashedFile` a struct? – Gert Arnold Oct 07 '13 at 08:03
  • I don't see your `GetHashCode` so it is hard to say why hash codes match fail (and hence no `Equal` is called). – Alexei Levenkov Oct 07 '13 at 16:29
  • I used a struct instead of a class because I noticed some performance issues when using a similar class in my program that I changed to a struct. My original GetHashCode was in a custom EqualityComparer class which I passed into linq's Except, however, it was grabbing different hashes for hashedFiles with 100% identical values. – MrFreeman Oct 08 '13 at 01:57

1 Answers1

0

If you want the type to handle its own comparisons in Except the interface you need is IEquatable. The IEqualityComparer interface is to have another type handle the comparisons so it can be passed into Except as an overload.

This achieves what you want (assuming you wanted both file string and hash compared).

public struct hashedFile : IEquatable<hashedFile>
{
    string _fileString;
    byte[] _fileHash;

    public hashedFile(string fileString, byte[] fileHash)
    {
        this._fileString = fileString;
        this._fileHash = fileHash;
    }

    public string FileString { get { return _fileString; } }
    public byte[] FileHash { get { return _fileHash; } }

    public bool Equals(hashedFile other)
    {
        return _fileString == other._fileString && _fileHash.SequenceEqual(other._fileHash);
    }
}

Here is an example in a working console application.

public class Program
{
    public struct hashedFile : IEquatable<hashedFile>
    {
        string _fileString;
        byte[] _fileHash;

        public hashedFile(string fileString, byte[] fileHash)
        {
            this._fileString = fileString;
            this._fileHash = fileHash;
        }

        public string FileString { get { return _fileString; } }
        public byte[] FileHash { get { return _fileHash; } }

        public bool Equals(hashedFile other)
        {
            return _fileString == other._fileString && _fileHash.SequenceEqual(other._fileHash);
        }
    }

    public static void Main(string[] args)
    {
        List<hashedFile> list1 = GetList1();
        List<hashedFile> list2 = GetList2();
        List<hashedFile> diff = list1.Except(list2).ToList();

        foreach (hashedFile h in diff)
        {
            Console.WriteLine(h.FileString + Environment.NewLine + h.FileHash[0].ToString("x2"));
        }

        Console.ReadLine();
    }

    private static List<hashedFile> GetList1()
    {
        hashedFile one = new hashedFile("test1", BitConverter.GetBytes(1));
        hashedFile two = new hashedFile("test2", BitConverter.GetBytes(2));
        hashedFile threeA = new hashedFile("test3", BitConverter.GetBytes(4));
        hashedFile four = new hashedFile("test4", BitConverter.GetBytes(4));

        var list1 = new List<hashedFile>();
        list1.Add(one);
        list1.Add(two);
        list1.Add(threeA);
        list1.Add(four);
        return list1;
    }

    private static List<hashedFile> GetList2()
    {
        hashedFile one = new hashedFile("test1", BitConverter.GetBytes(1));
        hashedFile two = new hashedFile("test2", BitConverter.GetBytes(2));
        hashedFile three = new hashedFile("test3", BitConverter.GetBytes(3));

        var list1 = new List<hashedFile>();
        list1.Add(one);
        list1.Add(two);
        list1.Add(three);
        return list1;
    }
}

This is becoming quite large but I will continue there is an issue with above implementation if hashedFile is a class not a struct (and sometimes when a stuct maybe version depdendant). Except uses an internal Set class the relevant part of that which is problematic is that it compares the hash codes and only if they are equal does it then use the comparer to check equality.

int hashCode = this.InternalGetHashCode(value);
for (int i = this.buckets[hashCode % this.buckets.Length] - 1; i >= 0; i = this.slots[i].next)
{
    if ((this.slots[i].hashCode == hashCode) && this.comparer.Equals(this.slots[i].value, value))
    {
        return true;
    }
}

The fix for this depending on performance requirements is you can just return a 0 hash code. This means the comparer will always be used.

public override int GetHashCode()
{
    return 0;
}

The other option is to generate a proper hash code this matters sooner than I expected the difference for 500 items is 7ms vs 1ms and for 5000 items is 650ms vs 13ms. So probably best to go with a proper hash code. byte array hash code function taken from https://stackoverflow.com/a/7244316/1002621

public override int GetHashCode()
{
    var hashCode = 0;
    var bytes = _fileHash.Union(Encoding.UTF8.GetBytes(_fileString)).ToArray();
    for (var i = 0; i < bytes.Length; i++)
        hashCode = (hashCode << 3) | (hashCode >> (29)) ^ bytes[i]; // Rotate by 3 bits and XOR the new value.
    return hashCode;
}
Community
  • 1
  • 1
David Ewen
  • 3,632
  • 1
  • 19
  • 30
  • Thanks for the suggestion, and yes I do want the file string and hash compared, however running this new struct with list1.Except(list2), the public bool Equals is never called, and the list returned is just list1 again. – MrFreeman Oct 07 '13 at 03:23
  • I am going to edit the answer to include a full console example that works on my machine maybe you can spot something I did that you missed. – David Ewen Oct 07 '13 at 03:46
  • How odd... using your example, the Equals method gets called when I use the Except method... but with my two custom lists of a string and an array of 16 bytes, the Equals method is never called... – MrFreeman Oct 07 '13 at 04:02
  • Hmmm, it looks like this might factor into my problem here [link](http://stackoverflow.com/a/1658166/612432) – MrFreeman Oct 07 '13 at 04:34
  • Seems that is related. I just did a quick test if I change hashedFile to a class Equals doesn't get hit but it does as a struct what is yours defined as? Your example has struct. – David Ewen Oct 07 '13 at 04:42
  • Like my example, I am using a struct, and not a class. I find it incredibly odd that everything is seemingly correct but the behavior is different... – MrFreeman Oct 07 '13 at 04:54
  • maybe it is framework dependent... either way I have updated answer with reason why this is happening and a work around. – David Ewen Oct 07 '13 at 05:04
  • Ah ha! That seems to have forced Except method to use our new Equals... however, for some reason, it is returning list1 in it's entirety still. I am on .NET Framework 4.5. – MrFreeman Oct 07 '13 at 05:13
  • try with the proper hash code I just added after that I am out of ideas :). I have done my testing on .NET Framework 4 – David Ewen Oct 07 '13 at 05:23
  • I don't even understand this at this point. I have changed to the new proper hash code and changed the target framework to 4, and it seems to be doing things correctly, but it refuses to do anything to regurgitate list1... I know that the Except method can also take an IEqualityComparer, but I'm not sure creating one for this would even work at this point... – MrFreeman Oct 07 '13 at 05:51
  • As a sort of test, I inserted 'if(_fileString == other._fileString) MessageBox.Show("YES!");' into the overloaded Equals, and for some strange reason, there is never a hit... – MrFreeman Oct 08 '13 at 05:59
  • After hours, and HOURS of debugging, I found that while my class which generates list1 and list2's fileString is common, I was passing an incorrect variable to list2, and the two objects would never be equal. I made some corrections to my code, but without this answer of overriding the equals and the gethashcode, this still would never have worked. Thanks much David! – MrFreeman Oct 08 '13 at 06:25