32

Recently I faced such question: What numbers will be printed considering the following code:

class Program
{
    static void Main(string[] args)
    {
        int[] numbers = { 1, 3, 5, 7, 9 };
        int threshold = 6;
        var query = from value in numbers where value >= threshold select value;

        threshold = 3;
        var result = query.ToList();

        result.ForEach(Console.WriteLine);
        Console.ReadLine();
    }
}

Answer: 3, 5, 7, 9

Which was quite surprising to me. I thought that threshold value will be put onto stack at the query construction and later at execution time, that number will be pulled back and used in the condition..which didn't happen.

Another case (numbers is set to null just before execution):

    static void Main(string[] args)
    {
        int[] numbers = { 1, 3, 5, 7, 9 };
        int threshold = 6;
        var query = from value in numbers where value >= threshold select value;

        threshold = 3;
        numbers = null;
        var result = query.ToList();
        ...
    }

Seems to have no effect on the query. It prints out exactly the same answer as in previous example.

Could anyone help me understand what is really going on behind the scene? Why changing threshold has the impact on the query execution while changing numbers doesn't?

michal-mad
  • 432
  • 4
  • 11
  • 20
    It's easier to understand if you change from fluent notation to functional notation. `numbers.Where(value => (value >= threshold)).Select(value => value)`. Now you see that `threshold` is inside a lambda (therefore delay-evalulated) but `numbers` is not in a lambda (therefore immediately-evaluated). – Raymond Chen Nov 19 '17 at 16:53
  • 5
    This is a good reason not to use the LINQ query syntax. It hides what's actually going on from you, for little if any benefit. I don't even find it more aesthetically pleasing, personally. – jpmc26 Nov 19 '17 at 22:20
  • 3
    [This great blog by John Skeet](https://codeblog.jonskeet.uk/2010/09/03/reimplementing-linq-to-objects-part-2-quot-where-quot/) really helped me understand a lot about LINQ. – oerkelens Nov 20 '17 at 07:37
  • @RaymondChen why are you writing answer in a comment?? – Matsemann Nov 20 '17 at 09:46
  • @Matsemann because it is not a full answer. See full answers below. – Raymond Chen Nov 20 '17 at 15:48
  • @RaymondChen then what's the point? – Matsemann Nov 20 '17 at 21:36
  • @Matsemann To act as a starting point. – Raymond Chen Nov 21 '17 at 03:08
  • 1
    This article explains what's happening behind the scenes in .NET Framework and .NET 5: https://levelup.gitconnected.com/linq-behind-the-scenes-efd664d9ebf8?sk=cba7416407ec8b753d9961fe23aac173 – David Klempfner Apr 20 '21 at 11:22

5 Answers5

31

Your query can be written like this in method syntax:

var query = numbers.Where(value => value >= threshold);

Or:

Func<int, bool> predicate = delegate(value) {
    return value >= threshold;
}
IEnumerable<int> query = numbers.Where(predicate);

These pieces of code (including your own query in query syntax) are all equivalent.

When you unroll the query like that, you see that predicate is an anonymous method and threshold is a closure in that method. That means it will assume the value at the time of execution. The compiler will generate an actual (non-anonymous) method that will take care of that. The method will not be executed when it's declared, but for each item when query is enumerated (the execution is deferred). Since the enumeration happens after the value of threshold is changed (and threshold is a closure), the new value is used.

When you set numbers to null, you set the reference to nowhere, but the object still exists. The IEnumerable returned by Where (and referenced in query) still references it and it does not matter that the initial reference is null now.

That explains the behavior: numbers and threshold play different roles in the deferred execution. numbers is a reference to the array that is enumerated, while threshold is a local variable, whose scope is ”forwarded“ to the anonymous method.

Extension, part 1: Modification of the closure during the enumeration

You can take your example one step further when you replace the line...

var result = query.ToList();

...with:

List<int> result = new List<int>();
foreach(int value in query) {
    threshold = 8;
    result.Add(value);
}

What you are doing is to change the value of threshold during the iteration of your array. When you hit the body of the loop the first time (when value is 3), you change the threshold to 8, which means the values 5 and 7 will be skipped and the next value to be added to the list is 9. The reason is that the value of threshold will be evaluated again on each iteration and the then valid value will be used. And since the threshold has changed to 8, the numbers 5 and 7 do not evaluate as greater or equal anymore.

Extension, part 2: Entity Framework is different

To make things more complicated, when you use LINQ providers that create a different query from your original and then execute it, things are slightly different. The most common examples are Entity Framework (EF) and LINQ2SQL (now largely superseded by EF). These providers create an SQL query from the original query before the enumeration. Since this time the value of the closure is evaluated only once (it actually is not a closure, because the compiler generates an expression tree and not an anonymous method), changes in threshold during the enumeration have no effect on the result. These changes happen after the query is submitted to the database.

The lesson from this is that you have to be always aware which flavor of LINQ you are using and that some understanding of its inner workings is an advantage.

Sefe
  • 13,731
  • 5
  • 42
  • 55
  • Maybe worth adding something about iterators to also explain how lazy execution works (because `Where` is iterator). – Evk Nov 19 '17 at 20:24
  • @Evk: I added some details about the deferred execution. More information than that would complicate the answer unnecessarily. – Sefe Nov 19 '17 at 21:12
  • "When you set numbers to null," I will be add that the 'IEnumerable query' continue to store reference to (real place of) object and therefore it not going to Garbage collection, and can be use. – Denis Sivtsov Feb 07 '23 at 13:42
2

Easiest is to see what will be generated by compiler. You can use this site: https://sharplab.io

using System.Linq;

public class MyClass
{
    public void MyMethod()
    {
        int[] numbers = { 1, 3, 5, 7, 9 };

        int threshold = 6;

        var query = from value in numbers where value >= threshold select value;

        threshold = 3;
        numbers = null;

        var result = query.ToList();
    }
}

And here is the output:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Security;
using System.Security.Permissions;

[assembly: AssemblyVersion("0.0.0.0")]
[assembly: Debuggable(DebuggableAttribute.DebuggingModes.Default | DebuggableAttribute.DebuggingModes.DisableOptimizations | DebuggableAttribute.DebuggingModes.IgnoreSymbolStoreSequencePoints | DebuggableAttribute.DebuggingModes.EnableEditAndContinue)]
[assembly: CompilationRelaxations(8)]
[assembly: RuntimeCompatibility(WrapNonExceptionThrows = true)]
[assembly: SecurityPermission(SecurityAction.RequestMinimum, SkipVerification = true)]
[module: UnverifiableCode]
public class MyClass
{
    [CompilerGenerated]
    private sealed class <>c__DisplayClass0_0
    {
        public int threshold;

        internal bool <MyMethod>b__0(int value)
        {
            return value >= this.threshold;
        }
    }

    public void MyMethod()
    {
        MyClass.<>c__DisplayClass0_0 <>c__DisplayClass0_ = new MyClass.<>c__DisplayClass0_0();
        int[] expr_0D = new int[5];
        RuntimeHelpers.InitializeArray(expr_0D, fieldof(<PrivateImplementationDetails>.D603F5B3D40E40D770E3887027E5A6617058C433).FieldHandle);
        int[] source = expr_0D;
        <>c__DisplayClass0_.threshold = 6;
        IEnumerable<int> source2 = source.Where(new Func<int, bool>(<>c__DisplayClass0_.<MyMethod>b__0));
        <>c__DisplayClass0_.threshold = 3;
        List<int> list = source2.ToList<int>();
    }
}
[CompilerGenerated]
internal sealed class <PrivateImplementationDetails>
{
    [StructLayout(LayoutKind.Explicit, Pack = 1, Size = 20)]
    private struct __StaticArrayInitTypeSize=20
    {
    }

    internal static readonly <PrivateImplementationDetails>.__StaticArrayInitTypeSize=20 D603F5B3D40E40D770E3887027E5A6617058C433 = bytearray(1, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 7, 0, 0, 0, 9, 0, 0, 0);
}

As you can see, if you change threshold variable, you really changes field in auto-generated class. Because you can execute query at any time, it is not possible to have reference to field which lives on the stack - because when you exit method, threshold will be removed from the stack - so compiler changes this field into auto-generated class with field of the same type.

And second problem: why null works (it is not visible in this code)

When you use: source.Where it calls this extension method:

   public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) {
        if (source == null) throw Error.ArgumentNull("source");
        if (predicate == null) throw Error.ArgumentNull("predicate");
        if (source is Iterator<TSource>) return ((Iterator<TSource>)source).Where(predicate);
        if (source is TSource[]) return new WhereArrayIterator<TSource>((TSource[])source, predicate);
        if (source is List<TSource>) return new WhereListIterator<TSource>((List<TSource>)source, predicate);
        return new WhereEnumerableIterator<TSource>(source, predicate);
    }

As you can see, it passes reference to:

WhereEnumerableIterator<TSource>(source, predicate);

And here is source code for where iterator:

    class WhereEnumerableIterator<TSource> : Iterator<TSource>
    {
        IEnumerable<TSource> source;
        Func<TSource, bool> predicate;
        IEnumerator<TSource> enumerator;

        public WhereEnumerableIterator(IEnumerable<TSource> source, Func<TSource, bool> predicate) {
            this.source = source;
            this.predicate = predicate;
        }

        public override Iterator<TSource> Clone() {
            return new WhereEnumerableIterator<TSource>(source, predicate);
        }

        public override void Dispose() {
            if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose();
            enumerator = null;
            base.Dispose();
        }

        public override bool MoveNext() {
            switch (state) {
                case 1:
                    enumerator = source.GetEnumerator();
                    state = 2;
                    goto case 2;
                case 2:
                    while (enumerator.MoveNext()) {
                        TSource item = enumerator.Current;
                        if (predicate(item)) {
                            current = item;
                            return true;
                        }
                    }
                    Dispose();
                    break;
            }
            return false;
        }

        public override IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector) {
            return new WhereSelectEnumerableIterator<TSource, TResult>(source, predicate, selector);
        }

        public override IEnumerable<TSource> Where(Func<TSource, bool> predicate) {
            return new WhereEnumerableIterator<TSource>(source, CombinePredicates(this.predicate, predicate));
        }
    }

So it just simply keeps reference to our source object in private field.

cs95
  • 379,657
  • 97
  • 704
  • 746
apocalypse
  • 5,764
  • 9
  • 47
  • 95
0

The variable "numbers" is the one on which the query has been instantiated and works on it. It retains the value it had when query was set. Wheras the "threshold" valiable is used in the predicate when the query is executed, which is in the ToList(). At that point the predicates finds the value on trashhold.

Anyway it's not a clear code...

Stefano Liboni
  • 149
  • 2
  • 11
0

I think easiest way to understand it is just look on it line per line and think about what and when is executed, as oppose to only declared in memory.

//this line declares numbers array
 int[] numbers = { 1, 3, 5, 7, 9 };

//that one declares value of threshold and sets it to 6
 int threshold = 6;

//that line declares the query which is not of the type int[] but probably IQueryable<int>, but never executes it at this point
//To create IQueryable it still iterates through numbers variable, and kind of assign lambda function to each of the items.
 var query = from value in numbers where value >= threshold select value;

//that line changes threshold value to 6
 threshold = 3;

//that line executes the query defined easier, and uses current value value of threshold, as it is only reference
 var result = query.ToList();

 result.ForEach(Console.WriteLine);
  Console.ReadLine();

That mechanism gives you some nice features like building the queries in multiple places and execute it once everyrhing is ready to go.

Setting value of numbers variable to null won’t change the result as it was immediately invoked, for enumeration.

madoxdev
  • 3,770
  • 1
  • 24
  • 39
0

Your LINQ query does not return the requested data, it returns the possibility to get something that can access the elements of your data one by one.

In software terms: the value of your LINQ statement is an IEnumerable<T> (or IQueryable<T> not further discussed here). This object does not hold your data. In fact, you can't do a lot with an IEnumerable<T>. The only thing it can do is produce another object that implements IEnumerator<T>. (note the difference: IEnumerable vs IEnumerator). This `GetEnumerator()' function is the "get something that can access ..." part in my first sentence.

The object you got from IEnumerable<T>.GetEnumerator(), implements IEnumerator. This object also does not have to hold your data. It only knows how to produce the first element of your data (if there is one), and if it has got an element, it knows how to get the next element (if there is one). This is the "that can access the elements of your data one by one" from my first sentence.

So both the IEnumerable<T> and the Enumerator<T> do not (have to) hold your data. They are only objects that help you to access your data in a defined order.

In the early days, when we didn't have List<T> or comparable collection classes that implemented IEnumerable<T> it was quite a nuisance to implement the IEnumerable<T> and the IEnumerator<T> functions Reset, Current and MoveNext. In fact, nowadays it is hard to find examples of implementing IEnumerator<T> that do not use a class that also implements IEnumerator<T>. Example

The introduction of the keyword Yield eased the implementation of IEnumerable<T> and IEnumerator<T> a lot. If a function contains a Yield return, it returns an IEnumerable<T>:

IEnumerable<double> GetMySpecialNumbers()
{   // returns the sequence: 0, 1, pi and e
    yield return 0.0;
    yield return 1.0;
    yield return 4.0 * Math.Atan(1.0);
    yield return Math.Log(1.0)
}

Note that I use the term sequence. It is not a List, not a Dictionary, you can only access the elements by asking for the first one, and repeatedly ask for the next one.

You could access the elements of the sequence using IEnumerable<T>.GetEnumerator() and the three functions of IEnumerator<T>. This method is seldom used anymore:

IEnumerable<double> myNumbers = GetMySpecialNumbers();
IEnumerator<double> enumerator = myNumbers.GetEnumerator();
enumerator.Reset();

// while there are numbers, write the next one
while(enumerator.MoveNext())
{   // there is still an element in the sequence
    double valueToWrite = enumerator.Current();
    Console.WriteLine(valueToWrite);
}

With the introduction of foreach this has become much easier:

foreach (double valueToWrite in GetMySpecialNumbers())
    Console.WriteLine(valueToWrite);

Internally this will do the GetNumerator() and the Reset() / MoveNext() / Current()

All generic collection classes like List, Array, Dictionary, HashTable, etc, implement IEnumerable. Most times that a function returns an IEnumerable, you'll find that internally it uses one of these collection classes.

Another great invention after yield and foreach was the introduction of extension methods. See extension methods demystified.

Extension methods enable you to take a class that you can't change, like List<T> and write new functionality for it, using only the functions you have access to.

This was the boost for LINQ. It enabled us to write new functionality for everything that said: "hey, I'm a sequence, you can ask for my first element and for my next element" (= I implement IEnumerable).

If you look at the source code of LINQ, you'll find that LINQ functions like Where / Select / First / Reverse / ... etc, are written as Extension functions of IEnumerable. Most of them use generic collection classes (HashTable, Dictionary), some of them use yield return, and sometimes you'll even see the basic IEnumerator functions like Reset / MoveNext

Quite often you'll write new functionality by concatenating LINQ functions. However, keep in mind that sometimes yield makes your function much easier to understand, and thus easier to reuse, debug and maintain.

Example: suppose you have a sequence of produced Products. Each Product has a DateTime property ProductCompletedTime that represents when its production of the product completed.

Suppose you want to know how much time there is between two completed products. Problem: this can't be calculated for the first product.

With a yield this is easy:

public static IEnumerable<TimeSpan> ToProductionTimes<Product>
    (this IEnumerable<Product> products)
{
    var orderedProducts = product.OrderBy(product => product.ProductionTime;
    Product previousProduct = orderedProducts.FirstOrDefault();
    foreach (Product product in orderedProducts.Skip(1))
    {
        yield return product.ProductCompletedTime - previouseProduct.ProductCompletedTime;
        previousProduct = product;
    }
}

Try to do this in Linq, it will be much harder to understand what happens.

Conclusion An IEnumerable does not hold your data, it only holds the potential to access your data one by one.

The most used methods to access the data are foreach, ToList(), ToDictionary, First, etc.

Whenever you need to write a function that returns a difficult IEnumerable<T> at least consider writing a yield return function.

Harald Coppoolse
  • 28,834
  • 7
  • 67
  • 116