4

I was reading through a question asking Is it better to call ToList() or ToArray() in LINQ queries? and found myself wondering why Enumerable.ToArray() wouldn't first just call the Count() method to find the size of the collection instead of using the internal Buffer{T} class which dynamically resizes itself. Something like the following:

T[] ToArray<T>(IEnumerable<T> source)
{
    var count = source.Count();
    var array = new T[count];

    int index = 0;
    foreach (var item in source) array[index++] = item;
    return array;
}

I know that we can't understand what is going through the minds of the designers and implementers and I'm sure they're much smarter than myself. So the best way to ask this question is what's wrong with the approach shown above? It seems to be less memory allocation and still operates in O(n) time.

Community
  • 1
  • 1
Anthony
  • 9,451
  • 9
  • 45
  • 72
  • 4
    Because calling `Count()` first would require enumerating the sequence twice. – Richard Deeming Jun 18 '13 at 15:43
  • Because by calling Count() and then iterating over the sequence again to get the elements would execute potential side effects twice. – Nikon the Third Jun 18 '13 at 15:45
  • @RichardDeeming not necessarily though. If the type also implements `ICollection` it is optimized to use the `Count` property directly. It still makes sense to avoid `Count()` here. – Brian Rasmussen Jun 18 '13 at 15:46
  • @BrianRasmussen Took the words out of my mouth. – Anthony Jun 18 '13 at 15:46
  • @BrianRasmussen: That's a big "if". The `ToArray` method can't assume that the input sequence will implement anything other that `IEnumerable`. – Richard Deeming Jun 18 '13 at 15:47
  • @RichardDeeming agree. There's no guarantee that it will implement `ICollection` so the safe bet is to avoid calling `Count()`. – Brian Rasmussen Jun 18 '13 at 15:48
  • @RichardDeeming If that were the case, then the Count() method wouldn't make the very same optimization. – Anthony Jun 18 '13 at 15:48
  • 1
    @BrianRasmussen: Plus, if the input sequence *does* implement `ICollection`, the `Buffer` class uses the `Count` property to allocate an array of the correct size. – Richard Deeming Jun 18 '13 at 15:49

4 Answers4

4

The Buffer<T> class has an optimization for the case where the source sequence implements ICollection<T>:

internal Buffer(IEnumerable<TElement> source)
{
   int length = 0;
   TElement[] array = null;
   ICollection<TElement> collection = source as ICollection<TElement>;
   if (collection != null)
   {
      length = collection.Count;
      if (length > 0)
      {
         array = new TElement[length];
         collection.CopyTo(array, 0);
      }
   }
   else
   {
      ...

If the sequence doesn't implement ICollection<T>, the code cannot assume that it's safe to enumerate the sequence twice, so it falls back to resizing the array as required.

Richard Deeming
  • 29,830
  • 10
  • 79
  • 151
4

First, the Buffer<T> class constructor also optimizes if the specified sequence can be casted to ICollection(like array or list) which has a Count property:

TElement[] array = null;
int num = 0;
ICollection<TElement> collection = source as ICollection<TElement>;
if (collection != null)
{
    num = collection.Count;
    if (num > 0)
    {
        array = new TElement[num];
        collection.CopyTo(array, 0);
    }
}
else
    // now we are going the long way ...

So if it's not a collection the query must be executed to get the total count. But using Enumerable.Count just to initialize the array correctly sized can be very expensive and - more important - could have dangerous side-effects. Hence it is unsafe.

Consider this simple File.ReadLines example:

var lines = File.ReadLines(path);
int count = lines.Count(); // executes the query which also disposes the underlying IO.TextReader 
var array = new string[count];
int index = 0;
foreach (string line in lines) array[index++] = line;

This will throw an ObjectDisposedException "Cannot read from a closed TextReader" since lines.Count() already executed the query and in the meantime the reader is disposed at foreach.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • Another example would be `BlockingCollection.GetConsumingEnumerable()`. Calling `Count()` would consume the items in the collection and the iteration would then return nothing. – svick Jun 18 '13 at 16:26
  • @TimSchmelter I would have voted for your answer [here](http://stackoverflow.com/a/17180637/57611), it being more complete--don't delete next time! – ErikE Jun 18 '13 at 23:59
1

Because Count() enumerates the source to the end. So it will at minimum do 2 iterations, one just for the count and the other to copy the items.

Now consider that the enumerable in question is a database cursor or something else similar that involves non trivial operations when iterating. That would be a performance killer.

It's a way better idea to just memcopy and extend the buffer. It might have a slight performance impact but it's very small and more important it's a known quantity.

Adrian Zanescu
  • 7,907
  • 6
  • 35
  • 53
  • BTW, you can't “memcopy” out of `IEnumerable`, you have to do it item by item. – svick Jun 18 '13 at 16:22
  • i was referring to memcopying the buffer array when extending it. Of course that you need to get every item one by one from the enumerable – Adrian Zanescu Jun 20 '13 at 08:40
0

If IEnumerable<T> and/or IEnumerator<T> had included a property to ask whether it "knew" its count, along with a Count property, it might have been worthwhile for ToArray() to make use of such a thing [having the Count be part of IEnumerator<T> would be helpful in cases where e.g. calling GetEnumerator on a a thread-safe mutable type will enumerate a snapshot]. Absent such ability, even if code has an ICollection or ICollection<T>, there's no way for it to know whether calling Count will take more or less time than creating extra temporary arrays.

That having been said, I would expect that the optimal implementation for something like ToArray would probably be to use a linked list of things, each of which holds some number of items, so that each item would be read into the space it would occupy until such time as it can be copied to the final array. The doubling strategy of List<T> doesn't seem particularly appropriate here, since it would be better to have information divided among multiple small arrays than to create an array of over 85,000 bytes (since the temporary arrays will be useless after it exits, having them end up on the Large Object Heap would be particularly bad).

supercat
  • 77,689
  • 9
  • 166
  • 211
  • “having the `Count` be part of `IEnumerator` would be helpful” Isn't that exactly what `IReadOnlyCollection` (and the less general `ICollection`) is for? – svick Jun 18 '13 at 16:27
  • @svick It also would remove the ability to have an infinite enumerable collection removing the stream-ability so-to-speak of an enumerable. – Anthony Jun 18 '13 at 16:56
  • @svick: Calling an interface member is generally faster than trying to cast to another interface type, whether or not the cast succeeds. Further, if type `Bunch` happens to implement `IList` but not the non-generic `ICollection`, there's no nice way for code which expects an `IEnumerable` to quickly find out how many things are in a `Bunch`. Also, while a thread-safe collection can offer snapshot enumeration semantics, the value returned by `ICollection.Count` may have no relation to the number of items returned by a preceding or following `GetEnumerator`. – supercat Jun 18 '13 at 16:57
  • @Anthony: If `IEnumerable` had a property to say whether it knew its count, such a property could return a "flags" `enum` to indicate whether the count was known, whether it could be over two billion, whether it could be infinite, whether it was known to be infinite, whether it should be expected to never change, whether it should be expected not to change between a `Count` and an complete enumeration if the thread performing such actions doesn't explicitly mutate the underlying collection, etc. – supercat Jun 18 '13 at 17:03
  • @supercat 1. If a type doesn't implement `ICollection` or `IReadOnlyCollection`, then it most likely also wouldn't return count from `IEnumerator`, even if it could. 2. If you want to have count-consistent snapshot, don't implement `IEnumerable`, but add a method like `IReadOnlyList Snaphost()`. 3. I think you're overcomplicating this (I'm especially worried about that “etc.”). You're creating an unreliable (it can always say “I don't know”) monster that will be useful in tiny minority of cases. Right now, `IEnumerator` is simple and easy to implement and I like it that way. – svick Jun 18 '13 at 17:53
  • @svick: Since `IList` and `ICollection` don't inherit from `IReadOnlyList` nor `ICollection`, and since prior to the introduction of covariant interfaces there was no advantage to implementing `ICollection`, there are a lot of implementations of those types which have a `Count` method but don't implement the aforementioned interfaces. As for the `etc`., having a method which returns a single flags enum doesn't seem overly complicated. Define the semantics so that "0" will be a "safe" value for an enumerator that doesn't know how to do anything special and go from there. – supercat Jun 18 '13 at 18:03
  • @svick: As for suggesting that a collection implement some "other" method for taking a snap-shot, it's normal for code that wants a snapshot of an `IEnumerable` to do something like `ToArray()` or `ToList()`. It's possible for a threadsafe collection to implement `IEnumerable` in such fashion that those methods work to take a snapshot; why require the use of some other method? – supercat Jun 18 '13 at 23:53