5

Recently I have been researching some conventions in writing functions that return a collection. I was wondering whether the function that actually uses a List<int> should return a List<int> or rather IList<int>, ICollection<int> or IEnumerable<int>. I created some tests for performance and I was quite surprised with the results.

static List<int> list = MakeList();
static IList<int> iList = MakeList();
static ICollection<int> iCollection = MakeList();
static IEnumerable<int> iEnumerable = MakeList();

public static TimeSpan Measure(Action f)
{
    var stopWatch = new Stopwatch();
    stopWatch.Start();
    f();
    stopWatch.Stop();
    return stopWatch.Elapsed;
}

public static List<int> MakeList()
{
    var list = new List<int>();
    for (int i = 0; i < 100; ++i)
    {
        list.Add(i);
    }
    return list;
}

public static void Main()
{
    var time1 = Measure(() => { // Measure time of enumerating List<int>
        for (int i = 1000000; i > 0; i-- ) {
            foreach (var item in list)
            {
                var x = item;
            }
        }
    });
    Console.WriteLine($"List<int> time: {time1}");

    var time2 = Measure(() => { // IList<int>
        for (int i = 1000000; i > 0; i-- ) {
            foreach (var item in iList)
            {
                var x = item;
            }
        }
    });
    Console.WriteLine($"IList<int> time: {time2}");

    var time3 = Measure(() => { // ICollection<int>
        for (int i = 1000000; i > 0; i-- ) {
            foreach (var item in iCollection)
            {
                var x = item;
            }
        }
    });
    Console.WriteLine($"ICollection<int> time: {time3}");

    var time4 = Measure(() => { // IEnumerable<int>
        for (int i = 1000000; i > 0; i-- ) {
            foreach (var item in iEnumerable)
            {
                var x = item;
            }
        }
    });
    Console.WriteLine($"IEnumerable<int> time: {time4}");
}

Output:

List<int> time: 00:00:00.7976577
IList<int> time: 00:00:01.5599382
ICollection<int> time: 00:00:01.7323919
IEnumerable<int> time: 00:00:01.6075277

I've tried different order of the measures or making the MakeList() return one of the above interfaces but all of it only confirms that returning a List<int> and processing it as a List<int> is about twice as fast as with the interfaces.

However various sources, including this answer claim that you should never return List<> and always use an interface.

So my question is:

  • Why is processing a List<int> about twice as fast as the interfaces?
  • What should we return from a function and how to manage the code if we care about performance?
Rasmond
  • 442
  • 4
  • 15
  • Enumerating through an interface involves a lot more virtual method calls than enumerating through a `List` (which is a pattern the JIT can specifically optimize, even beyond eliminating the virtual dispatch). The more relevant question is this: do you find iterating over a million items is a thing you do often? When you do, do you find yourself using trivial loops that make the enumeration the bottleneck in the process? I'd wager the answer to both is "no". In the rare case where the answer is "yes", a simple `as` would allow code that needs it to get at the concrete implementation. – Jeroen Mostert Oct 16 '19 at 10:19
  • 1
    rule of thumb: let input parameters be as general as possible: `void Do(IEnumerable source)` is *better* than `void Do(List source)`; the returning value is quite opposite: the *more specific*, the better. `List Do()` is better than `IEnumerable Do()` – Dmitry Bychenko Oct 16 '19 at 10:20
  • 1
    ^ Which is called https://en.wikipedia.org/wiki/Robustness_principle. – GSerg Oct 16 '19 at 10:21
  • Except if you're not returning a *copy* of the data - suppose you have in the class a `List` which is never going to be modified, and you don't want to make a copy of the list every time you return it. Then you might want to return it as `IEnumerable` or perhaps `IReadOnlyCollection`. – Matthew Watson Oct 16 '19 at 10:28
  • 2
    @DmitryBychenko Regarding the return value, I would argue that it should not be too specific either, so that the implementation has more room to change (i.e. return a different concrete type), without breaking the callers. – Sweeper Oct 16 '19 at 10:43
  • 1
    @DmitryBychenko: I agree with Sweeper; your advice is backwards. The implementation has *preconditions*, so *accept* the type that is *the most general type that meets the preconditions*. The caller has *requirements*, so *return the most general type that meets the requirements*. That gives maximum flexibility to both the implementor of the caller *and* the callee to make choices. – Eric Lippert Oct 16 '19 at 13:08
  • 1
    @DmitryBychenko: Moreover, in this specific case there is a semantic difference between returning a list and returning a sequence. Returning a list says "please go ahead and modify this; I've given you a copy". Returning a sequence says "do not use this as anything other than a sequence". And if the caller needs a copy, they can make one with `ToList` easily enough. – Eric Lippert Oct 16 '19 at 13:10

1 Answers1

9

Why is processing a List<int> about twice as fast as the interfaces?

Great question. When attempting to foreach something, C# first checks to see whether the type of the collection already has a method called GetEnumerator that returns a type that has MoveNext and Current. If it does, it calls those directly. If not, then it falls back to using IEnumerable<T> or IEnumerable and IEnumerator<T> or IEnumerator to get the enumerator so that it can call MoveNext and Current.

This design choice was made for two reasons. First, in the world of C# 1.0 before generics, it meant that you could call a Current that returned int; IEnumerator.Current of course is object and so would box the int, which is both a speed and a memory penalty. Second, it meant that authors of collections could do experiments to figure out which implementation of MoveNext and Current had the best performance.

The implementors of List<T> did exactly that; if you examine GetEnumerator on List<T> you will discover something interesting: it returns a mutable value type. Yes, mutable value types are considered an easily-abused bad practice. But because 99.999% of the usages of this overload of GetEnumerator are called on your behalf by foreach, the vast majority of time you never even notice that there's a mutable value for you to abuse, and therefore do not abuse it.

(NOTE: The takeaway of the preceding paragraph should not be "use mutable value types because they are fast". The takeaway should be understand the usage patterns of your users and then design a safe, efficient tool that meets their needs. Usually a mutable value type is not the right tool.)

Anyways, long story short, we avoid all kinds of virtual calls, interface type checks, and so on, by binding directly to methods on mutable value types when iterating something known at compile time to be List<T>.

What should we return from a function and how to manage the code if we care about performance?

If you care about speed performance then you should be concentrating on the slowest thing in the program. Is the slowest thing in your program calling MoveNext on a collection? If so, congratulations, you have a very fast program; MoveNext is the next thing to optimize. But in this case really you should be asking "how do I avoid or delay this loop entirely?" if you're in that boat.

If MoveNext is not the slowest thing in the program then who cares if it is a few nanoseconds slower in a particular implementation? Return the type that is logically the one closest to what the caller wants and needs and don't worry about the tiny penalty.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • 1
    Great answer and you actually pointed that the time differences I was concerned about are in fact **really** small when the loop does more than just `var x = item` – Rasmond Oct 16 '19 at 14:53
  • "Return the type that is logically the one closest to what the caller wants and needs". The problem is that is not always easy to know or understand the optimal type for a caller, especially if you are developing a library that can be used by many different developers for many different purposes. However, even when you are the caller, in particular if the collection is used in different ways by a chain of multiple calls, it may happen that the caller's needs change during the time of development so you have to refactor the code to return a different and more useful type. – Luca Cremonesi Oct 16 '19 at 16:10
  • 1
    @LucaCremonesi: If the thesis of your comment is that we often don't know how callers are going to use the method or what they need it for, then you are, unfortunately, correct but that could be better phrased as "*we often start writing code before actually understanding the problem we're trying to solve*", and we should stop doing that. – Eric Lippert Oct 16 '19 at 16:50
  • Sometimes, the problem we solve today is different than the problem we have to solve tomorrow as requirements and priorities change, new or different requests from new or existing customers arise. That's the reason why the agile software development is being adopted more and more. You don't want to release a product that solves problems of a year ago; you want to release a software that solves today's problems. – Luca Cremonesi Oct 16 '19 at 18:07