2

Could you tell me which one of those list joining methods is more efficient and why? Or is it the same performance-wise? Is there any other approach to this situation (joining 2 lists based on field property)?

My code:

public List<CombinedDatabaseStructure> Combine1(List<DatabaseStructure1> _Data1, List<DatabaseStructure2> _Data2)
{
    List<CombinedDatabaseStructure> combinedAll = new List<CombinedDatabaseStructure>();
    foreach (var row1 in _Data1)
    {
        foreach (var row2 in _Data2)
        {
            CombinedDatabaseStructure combined = new CombinedDatabaseStructure();
            if (row1.ID == row2.ID)
            {
                combined.DatabaseStructure1 = row1;
                combined.DatabaseStructure2 = row2;
                combinedAll.Add(combined);
            }
        }
    }

    return combinedAll;
}

Code2:

public List<CombinedDatabaseStructure> Combine2(List<DatabaseStructure1> _Data1, List<DatabaseStructure2> _Data2)
{
    var joined = from item1 in _Data1.AsEnumerable() 
                 join item2 in _Data2.AsEnumerable() on item1.ID equals item2.ID
                 select new CombinedDatabaseStructure (item1,item2);

    return  joined.ToList<CombinedDatabaseStructure>();
}
Heinzi
  • 167,459
  • 57
  • 363
  • 519
Piotr P
  • 67
  • 1
  • 2
  • 11
  • 3
    When you profiled it, what did you find? _Also, you will probably find you could remove your `.AsEnumerable()` code - it doesn't achieve anything._ – mjwills Jul 04 '17 at 14:33
  • Where is the list data coming from? e.g. is it coming from a database? – mjwills Jul 04 '17 at 14:34
  • The only way to prove which is the best is to test yourself, nobody else can tell you that. – DavidG Jul 04 '17 at 14:35
  • @mjwills yes, it's coming from DB – Piotr P Jul 04 '17 at 14:37
  • 1
    In that case, I would recommend getting the DB to do the join. It will be massively more efficient. – mjwills Jul 04 '17 at 14:37
  • @mjwills the thing is I'm getting data1 from SQL Database and data2 from Oracle Database and stored procedure approach (linked servers) is a no-go :/ – Piotr P Jul 04 '17 at 20:07

1 Answers1

11

As a general rule: If there's a built-in method in the .NET framework which does exactly what you want, it's usually a good idea to use it instead of re-implementing it. It's easier, more readable, less error-prone, better tested and usually more efficiently implemented.


Let's look at your particular problem in detail:

Option 1 is basically a manual (naive) implementation of a nested loop join with a complexity of O(n*m).

Option 2 uses LINQ-to-object's implementation of join, which internally uses a hash join, which has a complexity of O(n+m).

If you are worried about efficiency, I'd recommend "Option 3": let the database do the join. It can use statistics to choose the optimal join strategy for your data.


Note: Your nested loop implementation is very inefficient. It can be implemented with O(n*log(m)) complexity by using some kind of index lookup to find matching Data2 rows instead of the inner loop. In that case, a nested loop join can be faster than a hash join, if n is very small and m is large. This, however, assumes that the index already exists, since creating an index (for example, by creating a C# dictionary from your list) is a O(m) operation itself.

Heinzi
  • 167,459
  • 57
  • 363
  • 519