9

Some programmers argue that it is better to pass IEnumerable<T> parameters over passing implementations like List<T>, and one of the popular reasons to do this is that the API is immediately usable in more scenarios because more collections will be compatible with IEnumerable<T> than any other specific implementation, e.g. List<T>.

The more I dive into multi-threaded development scenarios, the more I am starting to think that IEnumerable<T> is not the correct type either and I will try to explain why below.

Have you ever received the following exception while enumerating a collection?

Collection was modified; enumeration operation may not execute. (an InvalidOperationException)

Collection was modified exception

Basically what causes this is, you are given the collection on one thread, while someone else modifies the collection on another thread.

To circumvent the problem, I then developed a habit to take a snapshot of the collection before I enumerate it, by converting the collection to an array inside the method, like this:

static void LookAtCollection(IEnumerable<int> collection)
{   
    foreach (int Value in collection.ToArray() /* take a snapshot by converting to an array */)
    {
        Thread.Sleep(ITEM_DELAY_MS);
    }
}

My question is this. Wouldn't it be better to code towards arrays instead of enumerables as a general rule, even if it means that callers are now forced to convert their collections to arrays when using your API?

Is this not cleaner and more bullet-proof? (the parameter type is an array instead)

static void LookAtCollection(int[] collection)
{   
    foreach (int Value in collection)
    {
        Thread.Sleep(ITEM_DELAY_MS);
    }
}

The array meets all the requirements of being enumerable and of a fixed length, and the caller is aware of the fact that you will be operating on a snapshot of the collection at that point in time, which can save more bugs.

The only better alternative I can find right now is the IReadOnlyCollection which will then be even more bullet proof because the collection is then also readonly in terms of item-content.

EDIT:

@DanielPryden provided a link to a very nice article "Arrays considered somewhat harmful". And the comments made by the writer "I rarely need a collection which has the rather contradictory properties of being completely mutable, and at the same time, fixed in size" and "In almost every case, there is a better tool to use than an array." kind of convinced me that arrays are not as close to the silver bullet as I had hoped for, and I agree with the risks and loopholes. I think now the IReadOnlyCollection<T> interface is a better alternative than both the array and the IEnumerable<T>, but it kind of leaves me with the question now: Does the callee enforce it by having a IReadOnlyCollection<T> parameter type in the method declaration? Or should the caller still be allowed to decide what implementation of IEnumerable<T> he passes into the method that looks at the collection?

Thanks for all the answers. I learned a lot from these responses.

Martin Lottering
  • 1,624
  • 19
  • 31
  • 7
    No, same boat. You can screw around with the array _elements_ in the other thread too. Even the code you posted with `.ToArray` isn't threadsafe if the collection changes as you iterate to convert it to an array. Your best bet I think is to document (method signature, docs, etc) the fact that you're hitting a threaded method and leave the responsibility to the caller to provide you with a collection that can be worked with in a threaded fashion. Let _them_ do the copying (or provide you with a safe collection). If they give you garbage, you give them `InvalidOperationException` errors. – Chris Sinclair Feb 20 '13 at 12:13
  • 2
    [Lock vs. ToArray for thread safe foreach access of List collection](http://stackoverflow.com/questions/3128889/lock-vs-toarray-for-thread-safe-foreach-access-of-list-collection) – Tim Schmelter Feb 20 '13 at 12:13
  • 1
    I forgot to mention, you can make your method signatures _require_ some threadsafe collection or other locking mechanism, but that means you'll have to impose that mechanism on your callers. EDIT: You can also do your copy _before_ you spin off into your extra threads; then the copying would be synchronous to the calling code. The only possibility for threading issues there would be if the caller is doing threaded operations on the same collection as well.(I'm of the boat of keeping it simple and just leaving the responsibility to the caller, but that depends on your intended API/usage) – Chris Sinclair Feb 20 '13 at 12:17
  • As you're only passing `IEnumerable`, this effectively gives you an informal contract that the called code cannot modify the collection. This is extremely desirable in concurrent dev. You should take a look into immutability and why it's so important when going parallel. – spender Feb 20 '13 at 12:22
  • 2
    @MartinLottering If I had a nickel for every time a "smaller and faster iteration" _actually solved_ the threading issue... I'd be pretty poor. I find it only serves to hide it and have the application fail in very rare/random ways. Usually though as long as I do the copy _before_ spinning off into the threads (so it's synchronous with the caller) tends to be enough for most uses; best that each thread has a completely separate snapshot of data before it even starts. That didn't seem to be the case from your `LookAtCollection` example (looked more like it was _already_ in the separate thread) – Chris Sinclair Feb 20 '13 at 16:09
  • 1
    @MartinLottering: Not a direct answer to the thread safety issue, but have you read [Arrays considered somewhat harmful](http://blogs.msdn.com/b/ericlippert/archive/2008/09/22/arrays-considered-somewhat-harmful.aspx)? It's generally not a good idea to pass *variables* where what you want to transfer are *values*. – Daniel Pryden Feb 21 '13 at 08:37
  • @DanielPryden That was a great read, thank you. I appreciated the statements _"I rarely need a collection which has the rather contradictory properties of being completely mutable, and at the same time, fixed in size"_ and _"In almost every case, there is a better tool to use than an array."_ One thing that does stand out for me though, his original problem with arrays are focused more on return parameters, not input parameters, but the problems can be the same. Thanks again for the link. – Martin Lottering Feb 21 '13 at 10:37

5 Answers5

4

Most methods are not thread-safe unless explicitly documented otherwise.

In this case I'd say it's up to the caller to take care of thread-safety. If the caller passes an IEnumerable<T> parameter to a method, he should hardly be surprised if the method enumerates it! If the enumeration is not thread-safe, the caller needs to create a snapshot in a thread-safe manner, and pass in the snapshot.

The callee can't do this by calling ToArray() or ToList() - these methods will also enumerate the collection, and the callee can't know what is required (e.g. a lock) to make a snapshot in a thread-safe way.

Joe
  • 122,218
  • 32
  • 205
  • 338
  • I like your answer, but don't you think if you get a lot of InvalidOperationExceptions in the same method and each of those are cases where another thread is modifying the collection, then it is the callees problem to solve and not each of the callers' fault? Would it not be better if the method just took an array and avoid future errors too? And to apply the same principle to other collection-related methods? – Martin Lottering Feb 20 '13 at 14:51
  • 2
    @MartinLottering - no, I disagree that it would be better. A multi-threaded caller always needs to consider thread-safety and choose the most appropriate solution for his scenario. This might be taking a snapshot of the collection (ToArray), but in some cases could be another solution (e.g. holding a lock while the method is being executed). – Joe Feb 20 '13 at 14:56
  • Ok, but I agree with you that the caller should make the array. I just think the callee should enforce it too, by requiring a `T[]` instead of an `IEnumerable`. – Martin Lottering Feb 20 '13 at 15:23
  • 2
    Why enforce the additional overhead when the callee doesn't know if the caller needs it? – Joe Feb 20 '13 at 18:19
4

The real "problem" here is that while T[] is probably a more specific parameter type than would be ideal, and allows the recipient free access to write any element (which may or may not be desirable), IEnumerable<T> is too general. From a type-safety standpoint, the caller can supply a reference to any object which implements IEnumerable<T>, but in reality only certain such objects will actually work and there's no clean way for the caller to know which ones those are.

If T is a reference type, a T[] will be inherently thread-safe for reading, writing, and enumeration, subject to certain conditions (e.g. threads accessing different elements will not interfere at all; if one thread writes an item around the time another thread reads or enumerates it, the latter thread will see either old or new data. None of Microsoft's collection interfaces offer any such guarantees, nor do they provide any means by which a collection can indicate what guarantees it can or cannot make.

My inclination would be to either use T[], or else define an IThreadSafeList<T> where T:class which would include members like CompareExchangeItem that would allow Interlocked.CompareExchange on an item, but would not include things like Insert and Remove which cannot be done in thread-safe fashion.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • [Citation needed]! Writing a new reference to a variable of reference type happens to be an atomic operation on 32-bit, but I'm not sure that behavior is mandated by the spec. But even if it was, just updating the reference atomically is not sufficient: you also need a memory barrier to safely publish an object to another thread. – Daniel Pryden Feb 21 '13 at 08:28
  • @DanielPryden: The spec very clearly specifies that reference writes will not "split". As for memory barriers, the semantics are such that the location may be written any time between when the write is requested and the following memory barrier, and the value seen by software may be anything between the previous read barrier and when the read request takes place; in many cases, such semantics may be acceptable. If a control is showing a bunch of progress bars for various operations that are in progress, it's fine if the progress bars don't represent 100% up-to-date information, but... – supercat Feb 21 '13 at 15:44
  • ...the updating of status bars should not impede the progress of the threads that are doing real work, nor should an attempt to report the progress of the various tasks fail as a result of work being done. If a job gets added during enumeration, the status update will be perfectly happy if the new job is included, or if it isn't; likewise if a job finishes and gets removed from the collection. If the status view has a memory barrier before it starts enumeration, it won't care if it sees changes that occur curing enumeration. – supercat Feb 21 '13 at 15:46
  • My objection was mainly to the phrase "a `T[]` will be *inherently* thread-safe ... subject to certain conditions". An array is not any more inherently thread-safe than any other variable. Certainly, "subject to certain conditions" *any* data structure can be thread safe -- those conditions are what makes all the difference. Mainly, it sounds like what you're advocating is broken for the same reason that [double-checked locking is broken in Java](http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html). See also [this answer](http://stackoverflow.com/a/1964832/128397) – Daniel Pryden Feb 22 '13 at 00:55
  • 1
    @DanielPryden: The pattern I mentioned is used in a situation where it's desirable to update a UI control with the progress of a task, but it doesn't matter whether the UI control is absolutely up to date except in the scenario where a task is complete, and I would rather have the UI control lag behind the actual status of the task than have the task wait for the UI control to reflect its progress. That's a very common pattern in what I'm doing, and I would expect it to be common in other peoples' code too. – supercat Feb 22 '13 at 15:50
  • @DanielPryden: As for the thread-safety of arrays, their guarantees regarding threading behavior are stronger than those of other mutable Framework classes that implement `IList` with an indexed setter. One could write a class which implemented `IList` in a way which would offer even stronger guarantees, but no existing Framework classes do so. – supercat Feb 22 '13 at 16:36
2

This problem isn't specific to threading. Try modifying a collection within a loop that explicitly or implicitly calls GetEnumerator. The result is the same. It's an illegal operation.

Grant Thomas
  • 44,454
  • 10
  • 85
  • 129
  • True. It does pose the same problem, even on the same thread. I have had that happen to me too and requires careful planning on how to fix it. – Martin Lottering Feb 20 '13 at 14:55
1

Threads should never be allowed to access the same objects as other threads. You should always duplicate the variables/objects/arrays and parse them at the threads leisure.

When it's finished it should call back to the main thread so that thread can either update the current values/merge the values.

If two objects can reach the same object at the same time you get or a tug of war, or exceptions or strange behaviour.

Imagine threads like this: You have the boss and two employees. The boss makes copies of the letter he has in his hand and gives it to the employees. The employees each has his own tasks, one to collect everything on the list, the other to check if the items on the list need to be ordered from suppliers. They each go in their own private rooms and the boss just puts the paper on the desk where he sometimes checks it to tell someone how much there at this moment is.

employee one returns with the ordered objects and gives them to the director who passes it on to the logistics department. An hour later employee two returns with an updated list with stock quantities. The director throws out the old list and replaces it with the new list.

Now the scenario where there is one list.

The boss gives the same instructions. Employee one takes the list and starts calling the first supplier. Employee 2 takes the list and starts picking items. The director wants to look at the list but the list is gone and the entire sales funnel comes to a grinding halt because there is no accurate information because the list is occupied. The first employee has finished calling and wants to update his list. He can't because the list is occupied and he can't continue with the next item because the list is gone so he's stuck twiddling his thumbs or throwing a fit. When employee two is ready with his list employee takes the list and tries to finish calling order statusses but the director storms in and takes the list and passes on the information to the sales team. The director and the employee both start to fight over the list resulting in that no work is done and the company dies.

That is why you always need to work with independant copies in threads unless you know only that thread exclusively needs access to that resource, because you never know when that specific thread is going to do something with that resource because you don't control CPU timing.

I hope this helps a bit.

Tschallacka
  • 27,901
  • 14
  • 88
  • 133
  • I understand your analogy, but I don't think it is fair to say that threads should never access the same objects. That would reduce the usefullness of threads quite a bit. – Martin Lottering Feb 20 '13 at 14:57
  • That's the entire purpose of threads. They are independant programs. You need to give threads specific missions to execute and then report back when they are finished. They should not do critical modifications to objects that other program code also needs. A good idea for threading would be calculating background drops for a game, logging, intensive calculations, checking for updates etc... When they have run a round they can report that they have finished and pass the results to the main thread which does his magic with it. http://en.wikipedia.org/wiki/Deadlock – Tschallacka Feb 20 '13 at 15:07
  • But simple locking or other types of handshaking can easily allow multiple threads work on the same data, especially if it is just read-access. – Martin Lottering Feb 20 '13 at 15:16
  • 1
    I would still go against it. I've written multiple threaded applications and it simply gives a lot more reliability and simplicity in programming and program flow if you work with copied objects. Now if RAM is an issue you have a valid point and you have to complicate your code and program flow, then I would agree with you. – Tschallacka Feb 20 '13 at 15:20
  • 1
    One should only allow multiple threads to access the same object in cases where the semantics will be consistent with what one wants. There are many scenarios where it's perfectly fine to have one thread read a reference to an immutable object at the same time as another thread is performing writing to that reference, if the thread that's reading would be happy with either the old or new state. Further, copying data, modifying it, and writing it back poses semantic problems of its own. Sometimes it's appropriate, but taking snapshots all the time is hardly a panacea. – supercat Feb 20 '13 at 21:56
  • reading and writing at the same time poses a lot of problems because you get inconsistent data when on a multi core system because both threads could simultaneously be run on seperate cores instead of being in the queue waiting to be processed. With integers the problem won't be that great, but with large rasters or image manipulations it poses a bigger problem. My setup is usually "main console" for user input, "main thread" to compile everything passed back by child threads and keep it in store for the UI and under that seperate child threads that 'render' and pass back everything. – Tschallacka Feb 21 '13 at 07:47
0

One of the idea behind IEnumerable is that it is a result of a LINQ query. Since the query is rather than declaration instead of executed code, the result of IEnumerable is only invoked while iterating in foreach or copying items to an Array or other structure. Now if we would like to store the result in a static structure like an Array, we will have to update it prior to iteration as well as IEnumerable or we will receive out-of-date data. IEnumerable.ToArray means: Update the collection and make the static copy of it. So my answer is, that if we have to do update in advance of operating on the collection, the IEnumerable is more concise. In other scenarios we can use Array in order to simplify code.

Ryszard Dżegan
  • 24,366
  • 6
  • 38
  • 56
  • The `IEnumerable`/`IEnumerator` interface predate `LINQ` by many years. The essential idea behind them is that they allow code to give other code a sequence of individual items, without the recipient code having to know anything about how the items are derived; unfortunately, the interfaces don't provide *any* information, even things that would be useful to know (e.g. is there any circumstance under which the set of items enumerated by a collection might change, or does the collection inherently "know", or can it quickly find out, how many items are in it. – supercat Feb 21 '13 at 15:52
  • @supercat: You're right. Given the `IEnumerable` we only know, that we can iterate through the collection and nothing more. Implementator is even able to provide `IEnumerable` of infinite set of random items (e.g. `while (true) yield return random.Next()`). So if we work with suspicious and unknown source, we can't even assume, that the `IEnumerable` is finite or immutable. I consider here `IEnumerable` as iterator from which we want to obtain up-to-date data. In such case I would use `IEnumerable` instead of `Array`. – Ryszard Dżegan Feb 22 '13 at 08:17