17

I get when one would return an IEnumerable from a method—when there's value in deferred execution. And returning a List or IList should pretty much be only when the result is going to be modified, otherwise I'd return an IReadOnlyCollection, so the caller knows what he's getting isn't intended for modification (and this lets the method even reuse objects from other callers).

However, on the parameter input side, I'm a little less clear. I could take an IEnumerable, but what if I need to enumerate more than once?

The saying "Be conservative in what you send, be liberal in what you accept" suggests taking an IEnumerable is good, but I'm not really sure.

For example, if there are no elements in the following IEnumerable parameter, a significant amount of work can be saved in this method by checking .Any() first, which requires ToList() before that to avoid enumerating twice.

public IEnumerable<Data> RemoveHandledForDate(IEnumerable<Data> data, DateTime dateTime) {
   var dataList = data.ToList();

   if (!dataList.Any()) {
      return dataList;
   }

   var handledDataIds = new HashSet<int>(
      GetHandledDataForDate(dateTime) // Expensive database operation
         .Select(d => d.DataId)
   );

   return dataList.Where(d => !handledDataIds.Contains(d.DataId));
}

So I'm wondering what is the best signature, here? One possibility is IList<Data> data, but accepting a list suggests that you plan to modify it, which is not correct—this method doesn't touch the original list, so IReadOnlyCollection<Data> seems better.

But IReadOnlyCollection forces callers to do ToList().AsReadOnly() every time which gets a bit ugly, even with a custom extension method .AsReadOnlyCollection. And that's not being liberal in what is accepted.

What is best practice in this situation?

This method is not returning an IReadOnlyCollection because there may be value in the final Where using deferred execution as the whole list is not required to be enumerated. However, the Select is required to be enumerated because the cost of doing .Contains would be horrible without the HashSet.

I don't have a problem with calling ToList, it just occurred to me that if I need a List to avoid multiple enumeration, why do I not just ask for one in the parameter? So the question here is, if I don't want an IEnumerable in my method, should I really accept one in order to be liberal (and ToList it myself), or should I put the burden on the caller to ToList().AsReadOnly()?

Further Information for those unfamiliar with IEnumerables

The real problem here is not the cost of Any() vs. ToList(). I understand that enumerating the entire list costs more than doing Any(). However, assume the case that the caller will consume all items in the return IEnumerable from the above method, and assume that the source IEnumerable<Data> data parameter comes from the result of this method:

public IEnumerable<Data> GetVeryExpensiveDataForDate(DateTime dateTime) {
    // This query is very expensive no matter how many rows are returned.
    // It costs 5 seconds on each `.GetEnumerator` call to get 1 value or 1000
    return MyDataProvider.Where(d => d.DataDate == dateTime);
}

Now if you do this:

var myData = GetVeryExpensiveDataForDate(todayDate);
var unhandledData = RemoveHandledForDate(myData, todayDate);
foreach (var data in unhandledData) {
   messageBus.Dispatch(data); // fully enumerate
)

And if RemovedHandledForDate does Any and does Where, you'll incur the 5 second cost twice, instead of once. This is why you should always take extreme pains to avoid enumerating an IEnumerable more than once. Do not rely on your knowledge that in fact it's harmless, because some future hapless developer may call your method some day with a newly implemented IEnumerable you never thought of, which has different characteristics.

The contract for an IEnumerable says that you can enumerate it. It does NOT promise anything about the performance characteristics of doing so more than once.

In fact, some IEnumerables are volatile and won't return any data upon a subsequent enumeration! Switching to one would be a totally breaking change if combined with multiple enumeration (and a very hard to diagnose one if the multiple enumeration was added later).

Don't do multiple enumeration of an IEnumerable.

If you accept an IEnumerable parameter, you are in effect promising to enumerate it exactly 0 or 1 times.

ErikE
  • 48,881
  • 23
  • 151
  • 196
  • did you use the ToList() to not iterate twice on the data collection? – Bruno Joaquim Nov 10 '15 at 19:00
  • @BrunoJoaquim yes, that's why. – ErikE Nov 10 '15 at 19:01
  • did you dont want to iterate twice due the lower performance that this can cause? I did a test here and .ToList() is a lot slower than .Any(). – Bruno Joaquim Nov 10 '15 at 19:03
  • According to the stopwatch, .Any() requires about 119 ticks, in other hand .ToList() requires about 1228154 ticks, it is a huge difference. – Bruno Joaquim Nov 10 '15 at 19:05
  • 1
    @BrunoJoaquim The cost is not in the `ToList()` or the `Any()`. Please work on understanding `IEnumerables` better before answering—the cost is in an `IEnumerable` that has a big startup penalty (such as querying a database). If you begin enumerating twice, you could conceivably call the database twice! You're barking up the wrong tree, here. – ErikE Nov 10 '15 at 19:07
  • And why (in your mind) is calling `ToList` bad? Seems like if you want to avoid enumerating the raw query you need to turn it into a concrete data structure somehow. – D Stanley Nov 10 '15 at 19:12
  • 1
    I don't have a problem with calling `ToList`, it just then occurred to me, if I need a list, why do I not just ask for one in the parameter? So the question here is, if I don't want an `IEnumerable` in my method, should I accept one in order to be liberal and just `ToList` it myself? – ErikE Nov 10 '15 at 19:15
  • 1
    I always think that .ToList() is a heavy call, did it not allocate another array and copy all items to the new array? This isn't slower? – Bruno Joaquim Nov 10 '15 at 19:15
  • 1
    @BrunoJoaquim In some cases it may be slower - but in some cases it may improve performance. – D Stanley Nov 10 '15 at 19:16
  • @DStanley Could you provide an example that it improve performance? – Bruno Joaquim Nov 10 '15 at 19:20
  • 3
    The OP's example is a good one - if the underlying object is an expensive query, it will be faster to call `ToList()` then two linq methods than executing the query twice. That's the point of the question. – D Stanley Nov 10 '15 at 19:28

5 Answers5

7

IReadOnlyCollection<T> adds to IEnumerable<T> a Count property and the corresonding promise that there is no deferred execution. It would be the appropriate parameter to ask for, if the parameter is where you want to tackle this problem.

However, I suggest asking for IEnumerable<T>, and calling ToList() in the implementation itself instead.

Observation: Both approaches have the drawback that the multiple enumeration may at some point be refactored away, rendering the parameter change or ToList() call redundant, which we may overlook. I do not think this can be avoided.

The case does speak for calling ToList() in the method body: Since the multiple enumeration is an implementation detail, avoiding it should be an implementation detail as well. This way, we avoid affecting the API. We also avoid changing back the API if the multiple enumeration ever gets refactored away. We also avoid propagating the requirement through a chain of methods, all of which would have to ask for an IReadOnlyCollection<T> just because of our multiple enumeration.

If you are concerned with the overhead of creating extra lists (when the output already is a list or so), Resharper suggests the following approach:

param = param as IList<SomeType> ?? param.ToList();

Of course, we can do even better, because we only need to protect against deferred execution - no need for a full-blown IList<T>:

param = param as IReadOnlyCollection<SomeType> ?? param.ToList();
Timo
  • 7,992
  • 4
  • 49
  • 67
4

There are definitely ways around that will let you accept IEnumerable<T>, only enumerate once and make sure you don't query the database multiple times. Solutions I can think of:

  • instead of using Any and Where you could use the enumerator directly. Call MoveNext instead of Any to see if there are any items in the collection, and manually iterate further in after making your database query.
  • use Lazy to initialize your HashSet.

The first one seems ugly, the second one might actually make a lot of sense:

public IEnumerable<Data> RemoveHandledForDate(IEnumerable<Data> data, DateTime dateTime)
{
    var ids = new Lazy<HashSet<int>>(
        () => new HashSet<int>(
       GetHandledDataForDate(dateTime) // Expensive database operation
          .Select(d => d.DataId)
    ));

    return data.Where(d => !ids.Value.Contains(d.DataId));
}
ErikE
  • 48,881
  • 23
  • 151
  • 196
MarcinJuraszek
  • 124,003
  • 15
  • 196
  • 263
  • It will not cause `GetHandleDataForDate` to be evaluated if `data` has no items -> `Where` predicate will not be called in that case. – MarcinJuraszek Nov 10 '15 at 19:10
  • Oh, I get it, you're right! Thanks for setting me straight. Deleting my comment. – ErikE Nov 10 '15 at 19:12
  • Upon review I noticed that the Lazy has a different undesirable characteristic where the input enumerable is not iterated until the consumer iterates the output enumerable. That could be unexpected and undesirable. – ErikE Jan 05 '17 at 16:19
  • 2
    That _is_ expected and desirable because you accept IEnumerable as an argument and return IEnumerable - IEnumerable's supposed to be lazy iterated! – pkuderov Apr 21 '17 at 17:01
3

You can take an IEnumerable<T> in the method, and use a CachedEnumerable similar to the one here to wrap it.

This class wraps an IEnumerable<T> and makes sure that it is only enumerated once. If you try to enumerate it again, it yield items from the cache.

Please note that such wrapper does not read all items from the wrapped enumerable immediately. It only enumerates individual items from the wrapped enumerable as you enumerate individual items from the wrapper, and it caches the individual items along the way.

This means that if you call Any on the wrapper, only a single item will be enumerated from the wrapped enumerable, and then such item will be cached.

If you then use the enumerable again, it will first yield the first item from the cache, and then continue enumerating the original enumerator from where it left.

You can do something like this to use it:

public IEnumerable<Data> RemoveHandledForDate(IEnumerable<Data> data, DateTime dateTime)
{
    var dataWrapper = new CachedEnumerable(data);
    ...
}

Notice here that the method itself is wrapping the parameter data. This way, you don't force consumers of your method to do anything.

Yacoub Massad
  • 27,509
  • 2
  • 36
  • 62
  • 1
    It should be noted that `CachedEnumerable` still creates a list behind-the-scenes, it just does it while the collection is enumerated. So the net effect is the same as calling `ToList` _then_ enumerating the list. – D Stanley Nov 10 '15 at 19:14
  • @DStanley, it is different. If you only consume 5 items, and the original enumerable contains 100 items, then using `CachedEnumerable` will only read 5 items from the original enumerable. If you use `ToList`, then you will read the whole 100 items when you needed only 5. – Yacoub Massad Nov 10 '15 at 19:16
  • Well in that case you should do `Take(5).ToList()` which _would_ have the same effect. – D Stanley Nov 10 '15 at 19:27
  • 1
    @DStanley, the difference in this case is that using `CachedEnumerable` allows you to use it multiple times naturally as `IEnumerable` (e.g. use Any, and then use Take(10)...) while knowing that internally, the original enumerable will be enumerated only once. What happens if you decide to take 10 items after you take 5 items without `CachedEnumerable`? The original enumerable will be enumerated twice. – Yacoub Massad Nov 10 '15 at 19:35
  • 6 years later, I'm selecting your answer. Your answer is the only one that allows `.Any()` to be used, without having to `.ToList()` the entire thing, but also not pay the possible performance or even lack-of-correctness penalties possible with multiple enumeration. Sorry I didn't accept your answer sooner! – ErikE Nov 18 '21 at 06:35
1

I don't think this can be solved just by changing the input types. If you want to allows more general structures than List<T> or IList<T> then you have to decide if/how to handle these possible edge cases.

Either plan for the worst case and spend a little time/memory creating a concrete data structure, or plan for the best case and risk the occasional query getting executed twice.

You might consider documenting that the method enumerates the collection multiple times so that the caller can decide if they want to pass in an "expensive" query, or hydrate the query before calling the method.

D Stanley
  • 149,601
  • 11
  • 178
  • 240
  • 1
    I don't think that documenting that the method enumerates the collection multiple times is a good solution. `IEnumerable` should never be enumerated twice. This is part of the contract of using that interface. [bad form to enumerate more than once](http://www.codethinked.com/done28099t-get-bit-by-lazy-evaluation), [exponential slowdown by enumerating twice](http://twistedoakstudios.com/blog/Post7694_achieving-exponential-slowdown-by-enumerating-twice), [avoid multiple enumeration of IEnumerable](https://fromjami.wordpress.com/2013/12/08/avoid-multiple-enumeration-of-ienumerable/) and so on. – ErikE Nov 10 '15 at 19:32
  • @ErikE "Never" is a bit strong - enumeratng a `List` twice is harmless. I agree that it _can_ cause issues if done unnecessarily, but sometimes _it is necessary_. That's why it's a _warning_ and not an _error_. If you _have_ to iterate multiple times then you need to decide which is worse - creating a list or risking multiple expensive operations. Since you can't know just form the input types that's why I suggested letting the _caller_ decide. – D Stanley Nov 10 '15 at 19:36
  • 1
    I see this as a "pit of success" vs. "pit of failure" issue. If you enumerate multiple times, eventually somewhere in your system someone is going to make a mistake and cause a problem. I prefer to write code that doesn't break even when callers do things that I wasn't expecting, and prefer not to make assumptions that callers can violate in a harmful way. It's just like inlining user strings into SQL code--it's a **very bad idea**, even if you *know for a fact* that your application currently works okay. Some day, someone will use your method and pass in untrustworthy user input. Whoops. – ErikE Nov 10 '15 at 19:44
  • So then plan for the worst and create a list - I don't see an easy way to have it both ways. Life insurance isn't free. You have to pay for it. – D Stanley Nov 10 '15 at 19:45
  • Right! I agree with you! So this question is really about the parameter data type, not (necessarily) about whether to `ToList` or not. Do you have thoughts for me on that aspect? – ErikE Nov 10 '15 at 19:47
  • Like I said - I don't think changing the parameter types will solve the issue. There's not an interface type that defines an "`IEnumerable` that is safe to enumerate multiple times". If you don't want to enumerate the underlying object multiple times then _don't do it_. – D Stanley Nov 10 '15 at 19:54
  • I believe `IReadOnlyCollection` adds exactly two things, namely a `Count` property and the (corresponding!) promise that there is no deferred execution. In effect, it is "an interface type that defines an `IEnumerable` that is safe to enumerate multiple times". So, that leaves us with the choice between demanding an `IReadOnlyCollection` (meaning client code using LINQ may have to call `ToList()` to satisfy our method) vs. calling `ToList()` in the method itself. For our API, the latter is cleaner. For readability of the method body, the former is cleaner, avoiding `param = param.ToList()`. – Timo Jan 05 '17 at 09:36
  • I just considered chaining. `MethodA` demands an `IEnumerable` and addresses it *once*, passing it to `MethodB`. `MethodB` will perform multiple enumeration. If `MethodB` demands an `IReadOnlyCollection`, it will pollute `MethodA`, which will have to either change its own parameter type to `IReadOnlyCollection` or call `ToList()`. I suppose `MethodB` should keep the responsibility localized, and call `ToList()` itself. Admittedly, needless extra lists may be created. – Timo Jan 05 '17 at 09:41
1

I would argue that IEnumerable<T> is a good option for an argument type. It is a simple, generic and easy to provide structure. There is nothing inherent about the IEnumerable contract that implies that one should only ever iterate it once.

In general, the performance cost for testing .Any() probably isn't high but, of course, cannot be guaranteed to be so. In the circumstances you describe, it could obviously be the case that iterating the first element has considerable overhead but this is by no means universal.

Changing the parameter type to something like IReadOnlyCollection<T> or IReadOnlyList<T> is an option but probably only a good one in the circumstance that some or all of the properties/methods provided by that interface are required.

If you don't need that functionality and instead want to guarantee that your method only iterates the IEnumerable once, you can do so by calling .ToList() or by turning it into some other appropriate type of collection but that is an implementation detail of the method itself. If the contract that you are designing requires "something which can be iterated" then IEnumerable<T> is a very appropriate choice.

Your method has the power to make guarantees about how many times any collection will be iterated, you don't need to expose that detail beyond the boundaries of your method.

By contrast, if you do choose to repeatedly enumerate an IEnumerable<T> inside your method then you must also take into consideration every eventuality which could be a result of that choice, for instance potentially getting different results in different circumstances due to deferred execution.

That said, as a point of best practise, I think it makes sense to try to avoid any side-effects in IEnumerables returned by your own code to the maximum extent possible - languages like Haskell can make use of lazy evaluation throughout safely because they go to great pains to avoid side effects. If nothing else, people who consume your code might not be as dilligent as you in guarding against multiple enumeration.

TheInnerLight
  • 12,034
  • 1
  • 29
  • 52
  • I respectfully disagree that there is nothing inherent in the `IEnumerable` contract that implies single enumeration. Perhaps it is not truly contract level, but there is a very strong implication. Writers of `IEnumerable` and `IQueryable`-returning methods need not protect users from the consequences of multiple enumeration, and are free to implement such that multiple enumerations are harmful. Basically, if you enumerate an `IEnumerable` twice, you are absolutely accepting any cost in doing so, and cannot rely on that cost not changing in the future. Bad code will almost invariably result. – ErikE Nov 10 '15 at 22:29
  • If you do want multiple enumeration, such as in a case where you actually want the underlying provider to re-fetch the data (perhaps you're polling a database and expecting changes to occur during the method's execution), then feel free to enumerate multiple times! – ErikE Nov 10 '15 at 22:30
  • @ErikE this is something of a problem when you have lazy evaluation combined with side effects. If you choose to use a raw `IEnumerable` from an external source and enumerate it multiple times then absolutely you have to guard against the possible consequences of that decision. Personally though, as a matter of style, I would always try to ensure the `IEnumerable`s returned by my functions are side-effect free so they can be enumerated repeatedly safely. Since LINQ provides functional-style features, adopting functional idioms when using it make sense to me. – TheInnerLight Nov 10 '15 at 23:37
  • I agree with you that inasmuch as it is possible, the `IEnumerable`-returning functions I write should be as side-effect free as possible. However, some `IEnumerable`s (and `IQueryable`s for that matter) just don't lend themselves to this when you are looking to perform composition on them. – ErikE Nov 10 '15 at 23:56