15

I have a large List in memory, from a class that has about 20 properties.

I'd like to filter this list based on just one property, for a particular task I only need a list of that property. So my query is something like:

data.Select(x => x.field).Where(x => x == "desired value").ToList()

Which one gives me a better performance, using Select first, or using Where?

data.Where(x => x.field == "desired value").Select(x => x.field).ToList()

Please let me know if this is related to the data type I'm keeping the data in memory, or field's type. Please note that I need these objects for other tasks too, so I can't filter them in the first place and before loading them into memory.

Cool Blue
  • 6,438
  • 6
  • 29
  • 68
Akbari
  • 2,369
  • 7
  • 45
  • 85
  • Seems to me SELECT/WHERE might be a little better than WHERE/SELECT because in the two-step process, the first step (SELECT) is more restrictive than WHERE, so the second step would have a smaller dataset to filter. This is just an observation, and I don't think the results would be much different until your List got to be huge. – Shannon Holsinger Aug 29 '16 at 03:34
  • Theoretically, one would think that the `Where` returns less than all the records (based on condition) while the `Select` returns all the records first then performs a conditional filtering on it. So it would seem that using `Where` first would give better performance. Nevertheless, you could try testing your queries out with a giant testing database of 10000+ records. Good Question! – Keyur PATEL Aug 29 '16 at 03:35
  • See? That's what I get for thinking! @Keyur PATEL seems to make more sense than I do. – Shannon Holsinger Aug 29 '16 at 03:38
  • 1
    When you have two horses and you want to know which is faster the way you do that is race them. You should do the same with both of these. See https://ericlippert.com/2012/12/17/performance-rant/ – Enigmativity Aug 29 '16 at 03:40
  • @Enigmativity: Horses idea will not work for this question. Quite simply, each unique state of the collection will represent a unique horse. Just too many horses to race. It's good however to link the article here. Will be helpful to the readers. – displayName Aug 29 '16 at 04:15
  • @displayName - Of course it works here. Otherwise this is just a theoretical exercise. – Enigmativity Aug 29 '16 at 04:17
  • @Enigmativity: Didn't get it yet. Why would it work here? For every unique state of the collection you get two horses (one with Select first, other with Where first). And the collection can take, practically too, millions of states. Can you really race twice of millions of horses? – displayName Aug 29 '16 at 04:20
  • @displayName - You can't know for all possible states, but you must run your horses at least once to get any practical idea of the speed difference. Not running them at all is just asinine. – Enigmativity Aug 29 '16 at 04:25
  • @Enigmativity: So you are saying that if practical states are in thousands then we should run thousands of horses? :D I think there's a better way (if my answer below is right). And that way is to keep track of collection's state when it is being changed. Though this would be subject to need of the application. – displayName Aug 29 '16 at 04:30
  • @displayName - Yes, ideally you should run thousands of races if **you feel there are many _different_ practical states**. – Enigmativity Aug 29 '16 at 05:07
  • We need to see the actual code, if you're keeping the data in memory what exactly are you storing? It's in a list so retrieved but when you access a property is it a property of the same table in the select or another property of a different table? In a normal case you'd always put where before select but if you're asking not out of curiosity but because your code is slow odds are you're accessing the DB on each select (which means if you have 10 000 rows you'rll be calling the DB 10 000 time instead of 1 time to retrieve all the data) – Ronan Thibaudau Feb 18 '17 at 01:52
  • When i say actual code i mean actual (not simplified for here) code for this query, as well as the actual code you use to construct the list you query (that is the code that actually query the database to create the list you're now querying) – Ronan Thibaudau Feb 18 '17 at 01:52

2 Answers2

15

Which one gives me a better performance, using Select first, or using Where.

Where first approach is more performant, since it filters your collection first, and then executes Select for filtered values only.

Mathematically speaking, Where-first approach takes N + N' operations, where N' is the number of collection items which fall under your Where condition.
So, it takes N + 0 = N operations at minimum (if no items pass this Where condition) and N + N = 2 * N operations at maximum (if all items pass the condition).

At the same time, Select first approach will always take exactly 2 * N operations, since it iterates through all objects to acquire the property, and then iterates through all objects to filter them.

Benchmark proof

I have completed the benchmark to prove my answer.

Results:

Condition value: 50
Where -> Select: 88 ms, 10500319 hits
Select -> Where: 137 ms, 20000000 hits

Condition value: 500
Where -> Select: 187 ms, 14999212 hits
Select -> Where: 238 ms, 20000000 hits

Condition value: 950
Where -> Select: 186 ms, 19500126 hits
Select -> Where: 402 ms, 20000000 hits

If you run the benchmark many times, then you will see that Where -> Select approach hits change from time to time, while Select -> Where approach always takes 2N operations.

IDEOne demonstration:

https://ideone.com/jwZJLt

Code:

class Point
{
    public int X { get; set; }
    public int Y { get; set; }
}

class Program
{
    static void Main()
    {
        var random = new Random();
        List<Point> points = Enumerable.Range(0, 10000000).Select(x => new Point { X = random.Next(1000), Y = random.Next(1000) }).ToList();

        int conditionValue = 250;
        Console.WriteLine($"Condition value: {conditionValue}");

        Stopwatch sw = new Stopwatch();
        sw.Start();

        int hitCount1 = 0;
        var points1 = points.Where(x =>
        {
            hitCount1++;
            return x.X < conditionValue;
        }).Select(x =>
        {
            hitCount1++;
            return x.Y;
        }).ToArray();

        sw.Stop();
        Console.WriteLine($"Where -> Select: {sw.ElapsedMilliseconds} ms, {hitCount1} hits");

        sw.Restart();

        int hitCount2 = 0;
        var points2 = points.Select(x =>
        {
            hitCount2++;
            return x.Y;
        }).Where(x =>
        {
            hitCount2++;
            return x < conditionValue;
        }).ToArray();

        sw.Stop();
        Console.WriteLine($"Select -> Where: {sw.ElapsedMilliseconds} ms, {hitCount2} hits");

        Console.ReadLine();
    }
}

Related questions

These questions can also be interesting to you. They are not related to Select and Where, but they are about LINQ order performance:

Does the order of LINQ functions matter?
Order of LINQ extension methods does not affect performance?

Community
  • 1
  • 1
Yeldar Kurmangaliyev
  • 33,467
  • 12
  • 59
  • 101
  • 1
    Might be true for the case you picked where `Point` is light weight with just two properties, but it may vary if we deal with large objects. – Hari Prasad Aug 29 '16 at 03:58
  • @HariPrasad I am not sure about how does the size of an object affects the execution time. Could you explain, please? – Yeldar Kurmangaliyev Aug 29 '16 at 03:59
  • @YeldarKurmangaliyev: I have added the test with a different object size. Can you take a look. Probably I have overlooked something. – displayName Aug 30 '16 at 00:42
  • Note that your 950 value example is a perfect example of the benchmark not working, you get about the same number of hits in both cases, so about the same amount of function calls, and one of the function takes less than half the time, showing the benchmark clearly fails even for linq to objects. To benchmark this there needs to be some warmup code as well as externalizing the test and running it thouthand of times skipping the first few runs and showing the min/max/average duration at the very least for each case – Ronan Thibaudau Feb 18 '17 at 01:47
5

The answer will depend on the state of your collection.

  • If most entities will pass the Where test, apply Select first;
  • If fewer entities will pass the Where test, apply Where first.

Update:

@YeldarKurmangaliyev wrote the answer with a concrete example and benchmarking. I ran similar code to verify his claim and our results are exactly opposite and that is because I ran the same test as his but with an object not as simple as the Point type he used to run his tests.

The code very much looks like his code, except that I changed the name of class from Point to EnumerableClass.

Given below the classes I used to constitute the EnumerableClass class:

public class EnumerableClass
{
    public int X { get; set; }
    public int Y { get; set; }
    public String A { get; set; }
    public String B { get; set; }
    public String C { get; set; }
    public String D { get; set; }
    public String E { get; set; }
    public Frame F { get; set; }
    public Gatorade Gatorade { get; set; }
    public Home Home { get; set; }
}

public class Home
{
    private Home(int rooms, double bathrooms, Stove stove, InternetConnection internetConnection)
    {
        Rooms = rooms;
        Bathrooms = (decimal) bathrooms;
        StoveType = stove;
        Internet = internetConnection;
    }

    public int Rooms { get; set; }
    public decimal Bathrooms { get; set; }
    public Stove StoveType { get; set; }
    public InternetConnection Internet { get; set; }

    public static Home GetUnitOfHome()
    {
        return new Home(5, 2.5, Stove.Gas, InternetConnection.Att);
    }
}

public enum InternetConnection
{
    Comcast = 0,
    Verizon = 1,
    Att = 2,
    Google = 3
}

public enum Stove
{
    Gas = 0,
    Electric = 1,
    Induction = 2
}

public class Gatorade
{
    private Gatorade(int volume, Color liquidColor, int bottleSize)
    {
        Volume = volume;
        LiquidColor = liquidColor;
        BottleSize = bottleSize;
    }

    public int Volume { get; set; }
    public Color LiquidColor { get; set; }
    public int BottleSize { get; set; }

    public static Gatorade GetGatoradeBottle()
    {
        return new Gatorade(100, Color.Orange, 150);
    }
}

public class Frame
{
    public int X { get; set; }
    public int Y { get; set; }

    private Frame(int x, int y)
    {
        X = x;
        Y = y;
    }

    public static Frame GetFrame()
    {
        return new Frame(5, 10);
    }
}

The classes Frame, Gatorade and Home have a static method each to return an instance of their type.

Below is the main program:

public static class Program
{
    const string Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    private static readonly Random Random = new Random();

    private static string RandomString(int length)
    {
        return new string(Enumerable.Repeat(Chars, length)
            .Select(s => s[Random.Next(s.Length)]).ToArray());
    }

    private static void Main()
    {
        var random = new Random();
        var largeCollection =
            Enumerable.Range(0, 1000000)
                .Select(
                    x =>
                        new EnumerableClass
                        {
                            A = RandomString(500),
                            B = RandomString(1000),
                            C = RandomString(100),
                            D = RandomString(256),
                            E = RandomString(1024),
                            F = Frame.GetFrame(),
                            Gatorade = Gatorade.GetGatoradeBottle(),
                            Home = Home.GetUnitOfHome(),
                            X = random.Next(1000),
                            Y = random.Next(1000)
                        })
                .ToList();

        const int conditionValue = 250;
        Console.WriteLine(@"Condition value: {0}", conditionValue);

        var sw = new Stopwatch();
        sw.Start();
        var firstWhere = largeCollection
            .Where(x => x.Y < conditionValue)
            .Select(x => x.Y)
            .ToArray();
        sw.Stop();
        Console.WriteLine(@"Where -> Select: {0} ms", sw.ElapsedMilliseconds);

        sw.Restart();
        var firstSelect = largeCollection
            .Select(x => x.Y)
            .Where(y => y < conditionValue)
            .ToArray();
        sw.Stop();
        Console.WriteLine(@"Select -> Where: {0} ms", sw.ElapsedMilliseconds);
        Console.ReadLine();

        Console.WriteLine();
        Console.WriteLine(@"First Where's first item: {0}", firstWhere.FirstOrDefault());
        Console.WriteLine(@"First Select's first item: {0}", firstSelect.FirstOrDefault());
        Console.WriteLine();
        Console.ReadLine();
    }
}

Results:

I ran the tests multiple times and found that

.Select().Where() performed better than .Where().Select().

when collection size is 1000000.


Here is the first test result where I forced every EnumerableClass object's Y value to be 5, so every item passed Where:

Condition value: 250
Where -> Select: 149 ms
Select -> Where: 115 ms

First Where's first item: 5
First Select's first item: 5

Here is the second test result where I forced every EnumerableClass object's Y value to be 251, so no item passed Where:

Condition value: 250
Where -> Select: 110 ms
Select -> Where: 100 ms

First Where's first item: 0
First Select's first item: 0

Clearly, the result is so dependent on the state of the collection that:

  • In @YeldarKurmangaliyev's tests .Where().Select() performed better; and,
  • In my tests .Select().Where() performed better.

The state of the collection, which I am mentioning over and over includes:

  • the size of each item;
  • the total number of items in the collection; and,
  • the number of items likely to pass the Where clause.

Response to comments on the answer:

Further, @Enigmativity said that knowing ahead of time the result of Where in order to know whether to put Where first or Select first is a Catch-22. Ideally and theoretically, he is correct and not surprisingly, this situation is seen in another domain of Computer Science - Scheduling.

The best scheduling algorithm is Shortest Job First where we schedule that job first that will execute for the least time. But, how would anyone know how much time will a particular job take to complete? Well, the answer is that:

Shortest job next is used in specialized environments where accurate estimates of running time are available.

Therefore, as I said right at the top (which was also the first, shorter version of my answer), the correct answer to this question will depend on the current state of the collection.

In general,

  • if your objects are within a reasonable size range; and,
  • you are Selecting a very small chunk out of each object; and,
  • your collection size is also not just in thousands,

then the guideline mentioned right at the top of this answer will be useful for you.

displayName
  • 13,888
  • 8
  • 60
  • 75
  • I can't tell beforehand about the passing items. If I was using `linq-to-sql` I would be always using where first. – Akbari Aug 29 '16 at 04:19
  • Can you explain your reasoning? – Enigmativity Aug 29 '16 at 04:22
  • @Akbari: I'm only guessing that for linq-to-sql it would not create a difference as the database would also optimize the query. – displayName Aug 29 '16 at 04:24
  • @Akbari: Read Enig and my comments on your question. I have mentioned what can you do about the case when the collection can be tracked. – displayName Aug 29 '16 at 04:32
  • @displayName - I think this answer needs some form of explanation. Right now it is arbitrary with very little science behind it. – Enigmativity Aug 29 '16 at 05:05
  • @displayName - Actually, thinking a little more about this answer, it is very impractical. It is saying that you need to know ahead of time what the result of a where clause is to know where to put the where clause. That's a catch-22. The **only** practical way is to get two horses and race them. – Enigmativity Aug 29 '16 at 05:08
  • @Enigmativity: Have tried to cover your concerns in my answer. Let me know what you think. I may still be wrong. – displayName Aug 29 '16 at 22:12
  • @HariPrasad: Your hunch, which you mentioned on the other answer here, is correct. I have covered it in my answer. – displayName Aug 29 '16 at 22:13
  • 1
    @Akbari: I have updated the answer to cover all that you have asked and others here have asked. If this isn't complete, let me know and I will try to add what's left. – displayName Aug 29 '16 at 22:15