15

Situation: Say we are executing a LINQ query that joins two in-memory lists (so no DbSets or SQL-query generation involved) and this query also has a where clause. This where only filters on properties included in the original set (the from part of the query).

Question: Does the linq query interpreter optimize this query in that it first executes the where before it performs the join, regardless of whether I write the where before or after the join? – so it does not have to perform a join on elements that are not included later anyways.

Example: For example, I have a categories list I want to join with a products list. However, I am just interested in the category with ID 1. Does the linq interpreter internally perform the exact same operations regardless of whether I write:

from category in categories
join prod in products on category.ID equals prod.CategoryID
where category.ID == 1 // <------ below join
select new { Category = category.Name, Product = prod.Name };

or

from category in categories
where category.ID == 1 // <------ above join
join prod in products on category.ID equals prod.CategoryID
select new { Category = category.Name, Product = prod.Name };

Previous research: I already saw this question but the OP author stated that his/her question is only targeting non-in-memory cases with generated SQL. I am explicitly interested with LINQ executing a join on two lists in-memory.

Update: This is not a dublicate of "Order execution of chain linq query" question as the referenced question clearly refers to a dbset and my question explicitly addressed a non-db scenario. (Moreover, although similar, I am not asking about inclusions based on navigational properties here but about "joins".)

Update2: Although very similar, this is also not a dublicate of "Is order of the predicate important when using LINQ?" as I am asking explicitly about in-memory situations and I cannot see the referenced question explicitly addressing this case. Moreover, the question is a bit old and I am actually interested in linq in the context of .NET Core (which didn't exist in 2012), so I updated the tag of this question to reflect this second point.

Please note: With this question I am aiming at whether the linq query interpreter somehow optimizes this query in the background and am hoping to get a reference to a piece of documentation or source code that shows how this is done by linq. I am not interested in answers such as "it does not matter because the performance of both queries is roughly the same".

Felix K.
  • 14,171
  • 9
  • 58
  • 72
  • 1
    Yes it does. There's no interpreter. A LINQ to Objects query is executed as is, it's not translated to something else. `Where()` is an iterator that loops over the input and returns any item that matches the predicate. You can check the source code directly to see how it's implemented, for [the full framework](https://referencesource.microsoft.com/#system.core/system/linq/Enumerable.cs,44b8532e11187695) and [.NET Core](https://github.com/dotnet/corefx/blob/master/src/System.Linq/src/System/Linq/Where.cs) – Panagiotis Kanavos Dec 18 '18 at 11:29
  • If you want tolerable performance you *shouldn't* join in-memory lists like that. You'll be making M*N comparisons. You should create dictionaries or hashsets to find entries with common keys – Panagiotis Kanavos Dec 18 '18 at 11:31
  • Possible duplicate of [Order execution of chain linq query](https://stackoverflow.com/questions/17853363/order-execution-of-chain-linq-query) – SᴇM Dec 18 '18 at 11:47
  • Also [Is order of the predicate important when using LINQ?](https://stackoverflow.com/questions/9436539/is-order-of-the-predicate-important-when-using-linq). – SᴇM Dec 18 '18 at 11:49
  • @SeM please see my points which state why this is not a dublicate. Your second link comes quite close but does not explicitly refer to in-memory joins. However your comment convinced me to sharpen my question in stating that I am interested in linq in the context of dotnet core. – Felix K. Dec 18 '18 at 12:15
  • @B12Toaster Well, I've shouldn't mark it as duplicate, it was only in answer which was talking about Linq 2 Object, so I've removed flag. – SᴇM Dec 18 '18 at 12:25
  • @B12Toaster as the answer stated in first link, **_"In other LINQ queries, LINQ-to-Objects, order can matter greatly as the query is not re-optimized the way SQL is and is simply processed from top to bottom"_** which I guess answers to your question **_"Does “where” position in LINQ query matter when joining in-memory?"_**. Yes it does, cause executing something on filtered result (in your case joining) is doing less work, than executing on everything, then filtering. – SᴇM Dec 18 '18 at 12:34

2 Answers2

9

The LINQ query syntax will be compiled to a method chain. For details, read e.g. in this question.

The first LINQ query will be compiled to the following method chain:

categories
    .Join(
        products,
        category => category.ID,
        prod => prod.CategoryID,
        (category, prod) => new { category, prod })
    .Where(t => t.category.ID == 1)
    .Select(t => new { Category = t.category.Name, Product = t.prod.Name });

The second one:

categories
    .Where(category => category.ID == 1)
    .Join(
        products,
        category => category.ID,
        prod => prod.CategoryID,
        (category, prod) => new { Category = category.Name, Product = prod.Name });

As you can see, the second query will cause less allocations (note only one anonymous type vs 2 in the first query, and note how many instances of those anonymous types will be created on performing the query).

Furthermore, it's clear that the first query will perform a join operation on lot more data than the second (already filtered) one.

There will be no additional query optimization in case of LINQ-to-objects queries.

So the second version is preferable.

dymanoid
  • 14,771
  • 4
  • 36
  • 64
  • 3
    It's not the number of anonymous types that's important here IMO - it's that in the second case we're joining on less data to start with. – Jon Skeet Dec 18 '18 at 12:20
  • @JonSkeet, yes, but that's pretty obvious. In the first case we not only join on more data, but also cause a lot more memory pressure. – dymanoid Dec 18 '18 at 12:23
  • 2
    You may think it's obvious, but I don't think that's necessarily obvious to the OP. (Or rather, the OP may well be wondering whether LINQ to Objects would perform that optimization automatically.) – Jon Skeet Dec 18 '18 at 12:25
  • 1
    @JonSkeet, okay, thanks for pointing that out. I updated my answer. – dymanoid Dec 18 '18 at 12:27
3

For in memory lists (IEnumerables), no optimization is applied and query execution is made in chained order for in-memory lists.

I also tried result by first casting it to IQueryable then apply filtering but apparently casting time is pretty high for this big table.

I made a quick test for this case.

Console.WriteLine($"List Row Count = {list.Count()}"); 
Console.WriteLine($"JoinList Row Count = {joinList.Count()}"); 

var watch = Stopwatch.StartNew();
var result = list.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
   .Where(t => t.inner.Prop3 == "Prop13")
   .Select(t => new { t.inner.Prop4, t.lst.Prop2}); 
result.Dump();
watch.Stop();

Console.WriteLine($"Result1 Elapsed = {watch.ElapsedTicks}");

watch.Restart();
var result2 = list
   .Where(t => t.Prop3 == "Prop13")
   .Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
   .Select(t => new { t.inner.Prop4, t.lst.Prop2});

result2.Dump();
watch.Stop();
Console.WriteLine($"Result2 Elapsed = {watch.ElapsedTicks}"); 

watch.Restart();
var result3 = list.AsQueryable().Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
   .Where(t => t.inner.Prop3 == "Prop13")
   .Select(t => new { t.inner.Prop4, t.lst.Prop2}); 
result3.Dump();
watch.Stop();
Console.WriteLine($"Result3 Elapsed = {watch.ElapsedTicks}"); 

Findings:

List Count = 100
JoinList Count = 10
Result1 Elapsed = 27
Result2 Elapsed = 17
Result3 Elapsed = 591

List Count = 1000
JoinList Count = 10
Result1 Elapsed = 20
Result2 Elapsed = 12
Result3 Elapsed = 586

List Count = 100000
JoinList Count = 10
Result1 Elapsed = 603
Result2 Elapsed = 19
Result3 Elapsed = 1277

List Count = 1000000
JoinList Count = 10
Result1 Elapsed = 1469
Result2 Elapsed = 88
Result3 Elapsed = 3219
Derviş Kayımbaşıoğlu
  • 28,492
  • 4
  • 50
  • 72
  • Thanks for your tests. I just had closer look at the implementation of [AsQueryable](https://github.com/dotnet/corefx/blob/master/src/System.Linq.Queryable/src/System/Linq/Queryable.cs#L21) and it's related [docs entry](https://learn.microsoft.com/en-us/dotnet/api/system.linq.queryable.asqueryable?view=netcore-2.2#System_Linq_Queryable_AsQueryable_System_Collections_IEnumerable_): *"AsQueryable(IEnumerable) returns [...] an IQueryable that executes queries by calling the equivalent query operator methods in Enumerable instead of those in Queryable."* ... – Felix K. Dec 18 '18 at 15:21
  • 1
    ...and it looks like when resolving/executing the `EnumerableQuery` eventually, it will simply execute the `Where` and `Join` in the order it was stated in the chained expression – so using `AsQueryable` seems not to provide an optimization benefit here. – Felix K. Dec 18 '18 at 15:21