Intersection() and Except() is too slow with large collections of custom objects

Question

I am importing data from another database.

My process is importing data from a remote DB into a List<DataModel> named remoteData and also importing data from the local DB into a List<DataModel> named localData.

I am then using LINQ to create a list of records that are different so that I can update the local DB to match the data pulled from remote DB. Like this:

var outdatedData = this.localData.Intersect(this.remoteData, new OutdatedDataComparer()).ToList();

I am then using LINQ to create a list of records that no longer exist in remoteData, but do exist in localData, so that I delete them from local database.

Like this:

var oldData = this.localData.Except(this.remoteData, new MatchingDataComparer()).ToList();

I am then using LINQ to do the opposite of the above to add the new data to the local database.

Like this:

var newData = this.remoteData.Except(this.localData, new MatchingDataComparer()).ToList();

Each collection imports about 70k records, and each of the 3 LINQ operation take between 5 - 10 minutes to complete. How can I make this faster?

Here is the object the collections are using:

internal class DataModel
{
        public string Key1{ get; set; }
        public string Key2{ get; set; }

        public string Value1{ get; set; }
        public string Value2{ get; set; }
        public byte? Value3{ get; set; }
}

The comparer used to check for outdated records:

class OutdatedDataComparer : IEqualityComparer<DataModel>
{
    public bool Equals(DataModel x, DataModel y)
    {
        var e =
            string.Equals(x.Key1, y.Key1) &&
            string.Equals(x.Key2, y.Key2) && (
                !string.Equals(x.Value1, y.Value1) ||
                !string.Equals(x.Value2, y.Value2) ||
                x.Value3 != y.Value3
                );
        return e;
    }

    public int GetHashCode(DataModel obj)
    {
        return 0;
    }
}

The comparer used to find old and new records:

internal class MatchingDataComparer : IEqualityComparer<DataModel>
{
    public bool Equals(DataModel x, DataModel y)
    {
        return string.Equals(x.Key1, y.Key1) && string.Equals(x.Key2, y.Key2);
    }

    public int GetHashCode(DataModel obj)
    {
        return 0;
    }
}

Hash codes are used to locate objects in hash tables, which is probably what `Except` and `Intersect` use internally to find matching objects. By returning a constant value, all objects will have the same location, and the search for matches degrades to a linear search among all candidates. You need to implement `GetHashCode` properly based on the properties used for equality. — Lee, Nov 01 '13 at 20:59
Correct. I added a hash code and the operation takes a split second! Thank you. — Theo, Nov 01 '13 at 21:55

Scott Chamberlain · Accepted Answer · 2013-11-01T22:05:59.113

By having a constant hash code you have ruined the performance. Here is the internal code Intersect uses (gotten via a decompiler)

public static IEnumerable<TSource> Intersect<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw Error.ArgumentNull("first");
    }
    if (second == null)
    {
        throw Error.ArgumentNull("second");
    }
    return Enumerable.IntersectIterator<TSource>(first, second, comparer);
}

private static IEnumerable<TSource> IntersectIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
    Set<TSource> set = new Set<TSource>(comparer);
    foreach (TSource current in second)
    {
        set.Add(current);
    }
    foreach (TSource current2 in first)
    {
        if (set.Remove(current2))
        {
            yield return current2;
        }
    }
    yield break;
}

See that it is using a Set internally, if you implemented the hashcode would greatly improve it's performance.

MatchingDataCompaer is the easier of the two so I will do that one for you.

internal class MatchingDataComparer : IEqualityComparer<DataModel>
{
    public MatchingDataComparer()
    {
        comparer = StringComparer.Ordnal; //Use whatever comparer you want.
    }

    private readonly StringComparer comparer;

    public bool Equals(DataModel x, DataModel y)
    {
        return comparer.Equals(x.Key1, y.Key1) && comparer.Equals(x.Key2, y.Key2);
    }

    //Based off of the advice from http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode
    public int GetHashCode(DataModel obj)
    {    
        unchecked // Overflow is fine, just wrap
        {
            int hash = 17;
            hash = hash * 23 + comparer.GetHashCode(obj.Key1);
            hash = hash * 23 + comparer.GetHashCode(obj.Key2);
            return hash;
        }
    }
}

You could potentially use the hashcode function from MatchingDataComparer in OutdatedDataComparer, it may not be the "optimial" hash code¹, but it would would be a "legal"² one, and would be a lot faster than a hard coded 0.

^{1. Or it may be, I am not sure how I would include that 3rd && condition

2. If a.Equals(b) == true then a.GetHashCode() == b.GetHashCode().

If a.Equals(b) == false then a.GetHashCode() == b.GetHashCode() || a.GetHashCode() != b.GetHashCode()}

Thank you for the detailed explanation Scott! I implemented the hash code and the operation is fast now. — Theo, Nov 01 '13 at 21:58

Intersection() and Except() is too slow with large collections of custom objects

1 Answers1