Linq join two lists: is it more efficient to use Dictionary?

Question

Final rephrase Below I join two sequences and I wondered if it would be faster to create a Dictionary of one sequence with the keySelector of the join as key and iterate through the other collection and find the key in the dictionary.

This only works if the key selector is unique. A real join has no problem with two records having the same key. In a dictionary you'll have to have unique keys

I measured the difference, and I noticed that the dictionary method is about 13% faster. In most use cases ignorable. See my answer to this question

Rephrased question

Some suggested that this question is the same question as LINQ - Using where or join - Performance difference?, but this one is not about using where or join, but about using a Dictionary to perform the join.

My question is: if I want to join two sequences based on a key selector, which method would be faster?

Put all items of one sequence in a Dictionary and enumerate the other sequence to see if the item is in the Dictionary. This would mean to iterate through both sequences once and calculate hash codes on the keySelector for every item in both sequences once.
The other method: use System.Enumerable.Join.

The question is: Would Enumerable.Join for each element in the first list iterate through the elements in the second list to find a match according to the key selector, having to compare N * N elements (is this called second order?) or would it use a more advanced method?

Original question with examples

I have two classes, both with a property Reference. I have two sequences of these classes and I want to join them based on equal Reference.

Class ClassA
{
     public string Reference {get;}
     ...
}

public ClassB
{
     public string Reference {get;}
     ...
}

var listA = new List<ClassA>()
{
    new ClassA() {Reference = 1, ...},
    new ClassA() {Reference = 2, ...},
    new ClassA() {Reference = 3, ...},
    new ClassA() {Reference = 4, ...},
}

var listB = new List<ClassB>()
{
    new ClassB() {Reference = 1, ...},
    new ClassB() {Reference = 3, ...},
    new ClassB() {Reference = 5, ...},
    new ClassB() {Reference = 7, ...},
}

After the join I want combinations of ClassA objects and ClassB objects that have an equal Reference. This is quite simple to do:

var myJoin = listA.Join(listB,           // join listA and listB
    a => a.Reference,                    // from listA take Reference
    b => b.Reference,                    // from listB take Reference
    (objectA, objectB) =>                // if references equal
        new {A = objectA, B = objectB}); // return combination

I'm not sure how this works, but I can imagine that for each a in listA the listB is iterated to see if there is a b in listB with the same reference as A.

Question: if I know that the references are Distinct wouldn't it be more efficient to convert B into a Dictionary and compare the Reference for each element in listA:

var dictB = listB.ToDictionary<string, ClassB>()
var myJoin = listA
    .Where(a => dictB.ContainsKey(a.Reference))
    .Select(a => new (A = a, B = dictB[a.Reference]);

This way, every element of listB has to be accessed once to put in the dictionary and every element of listA has to be accessed once, and the hascode of Reference has to be calculated once.

Would this method be faster for large collections?

`Enumerable.Join` [also uses a set](http://stackoverflow.com/questions/5551264/why-is-linq-join-so-much-faster-than-linking-with-where) under the hood. — Tim Schmelter, Aug 26 '15 at 14:42
possible duplicate of [LINQ - Using where or join - Performance difference?](http://stackoverflow.com/questions/3014123/linq-using-where-or-join-performance-difference) — Ham3d, Aug 26 '15 at 14:53
Yest Tim you are right. Otherwise I can't explain the small performance difference between dictionary and Enumerable.Join. This shows the answer to the related question: don't use Enumerable.Where if you want to Join. — Harald Coppoolse, Aug 27 '15 at 09:08

score 0 · Accepted Answer · answered Aug 27 '15 at 09:02

I created a test program for this and measured the time it took.

Suppose I have a class of Person, each person has a name and a Father property which is of type Person. If the Father is not know, the Father property is null

I have a sequence of Bastards (no father) that have exactly one Son and One Daughter. All Daughters are put in one sequence. All sons are put in another sequences.

The query: join the sons and the daughters that have the same father.

Results: Joining 1 million families using Enumerable.Join took 1.169 sec. Joining them using Dictionary join used 1.024 sec. Ever so slightly faster.

The code:

class Person : IEquatable<Person>
{
    public string Name { get; set; }
    public Person Father { get; set; }
    // + a lot of equality functions get hash code etc
    // for those interested: see the bottom
}

const int nrOfBastards = 1000000; // one million
var bastards = Enumerable.Range (0, nrOfBastards)
    .Select(i => new Person()
        { Name = 'B' + i.ToString(), Father = null })
    .ToList();

var sons = bastards.Select(father => new Person()
        {Name = "Son of " + father.Name, Father = father})
    .ToList();

var daughters = bastards.Select(father => new Person()
        {Name = "Daughter of " + father.Name, Father = father})
    .ToList();

// join on same parent: Traditionally and using Dictionary
var stopwatch = Stopwatch.StartNew();
this.TraditionalJoin(sons, daughters);
var time = stopwatch.Elapsed;
Console.WriteLine("Traditional join of {0} sons and daughters took {1:F3} sec", nrOfBastards, time.TotalSeconds);

stopwatch.Restart();
this.DictionaryJoin(sons, daughters);
time = stopwatch.Elapsed;
Console.WriteLine("Dictionary join of {0} sons and daughters took {1:F3} sec", nrOfBastards, time.TotalSeconds);
}

private void TraditionalJoin(IEnumerable<Person> boys, IEnumerable<Person> girls)
{   // join on same parent
    var family = boys
        .Join(girls,
            boy => boy.Father,
            girl => girl.Father,
            (boy, girl) => new { Son = boy.Name, Daughter = girl.Name })
        .ToList();
}

private void DictionaryJoin(IEnumerable<Person> sons, IEnumerable<Person> daughters)
{
    var sonsDictionary = sons.ToDictionary(son => son.Father);
    var family = daughters
        .Where(daughter => sonsDictionary.ContainsKey(daughter.Father))
        .Select(daughter => new { Son = sonsDictionary[daughter.Father], Daughter = daughter })
        .ToList();
}

For those interested in the equality of Persons, needed for a proper dictionary:

class Person : IEquatable<Person>
{
    public string Name { get; set; }
    public Person Father { get; set; }

    public bool Equals(Person other)
    {
        if (other == null)
            return false;
        else if (Object.ReferenceEquals(this, other))
            return true;
        else if (this.GetType() != other.GetType())
            return false;
        else
            return String.Equals(this.Name, other.Name, StringComparison.OrdinalIgnoreCase);
    }

    public override bool Equals(object obj)
    {
        return this.Equals(obj as Person);
    }

    public override int GetHashCode()
    {
        const int prime1 = 899811277;
        const int prime2 = 472883293;
        int hash = prime1;
        unchecked
        {
            hash = hash * prime2 + this.Name.GetHashCode();
            if (this.Father != null)
            {
                hash = hash * prime2 + this.Father.GetHashCode();
            }
        }
        return hash;
    }

    public override string ToString()
    {
        return this.Name;
    }

    public static bool operator==(Person x, Person y)
    {
        if (Object.ReferenceEquals(x, null))
            return Object.ReferenceEquals(y, null);
        else
            return x.Equals(y);
    }

    public static bool operator!=(Person x, Person y)
    {
        return !(x==y);
    }
}

Linq join two lists: is it more efficient to use Dictionary?

1 Answers1