Why is ConcurrentBag so slow in .Net (4.0)? Am I doing it wrong?

Question

Before I started a project, I wrote a simple test to compare the performance of ConcurrentBag from (System.Collections.Concurrent) relative to locking & lists. I am extremely surprised that ConcurrentBag is over 10 times slower than locking with a simple List. From what I understand, the ConcurrentBag works best when the reader and writer is the same thread. However, I hadn't thought it's performance would be so much worse than traditional locks.

I have run a test with two Parallel for loops writing to and reading from a list/bag. However, the write by itself shows a huge difference:

private static void ConcurrentBagTest()
   {
        int collSize = 10000000;
        Stopwatch stopWatch = new Stopwatch();
        ConcurrentBag<int> bag1 = new ConcurrentBag<int>();

        stopWatch.Start();


        Parallel.For(0, collSize, delegate(int i)
        {
            bag1.Add(i);
        });


        stopWatch.Stop();
        Console.WriteLine("Elapsed Time = {0}", 
                          stopWatch.Elapsed.TotalSeconds);
 }

On my box, this takes between 3-4 secs to run, compared to 0.5 - 0.9 secs of this code:

       private static void LockCollTest()
       {
        int collSize = 10000000;
        object list1_lock=new object();
        List<int> lst1 = new List<int>(collSize);

        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();


        Parallel.For(0, collSize, delegate(int i)
            {
                lock(list1_lock)
                {
                    lst1.Add(i);
                }
            });

        stopWatch.Stop();
        Console.WriteLine("Elapsed = {0}", 
                          stopWatch.Elapsed.TotalSeconds);
       }

As I mentioned, doing concurrent reads and writes doesn't help the concurrent bag test. Am I doing something wrong or is this data structure just really slow?

[EDIT] - I removed the Tasks because I don't need them here (Full code had another task reading)

[EDIT] Thanks a lot for the answers. I am having a hard time picking "the right answer" since it seems to be a mix of a few answers.

As Michael Goldshteyn pointed out, the speed really depends on the data. Darin pointed out that there should be more contention for ConcurrentBag to be faster, and Parallel.For doesn't necessarily start the same number of threads. One point to take away is to not do anything you don't have to inside a lock. In the above case, I don't see myself doing anything inside the lock except may be assigning the value to a temp variable.

Additionally, sixlettervariables pointed out that the number of threads that happen to be running may also affect results, although I tried running the original test in reverse order and ConcurrentBag was still slower.

I ran some tests with starting 15 Tasks and the results depended on the collection size among other things. However, ConcurrentBag performed almost as well as or better than locking a list, for up to 1 million insertions. Above 1 million, locking seemed to be much faster sometimes, but I'll probably never have a larger datastructure for my project. Here's the code I ran:

        int collSize = 1000000;
        object list1_lock=new object();
        List<int> lst1 = new List<int>();
        ConcurrentBag<int> concBag = new ConcurrentBag<int>();
        int numTasks = 15;

        int i = 0;

        Stopwatch sWatch = new Stopwatch();
        sWatch.Start();
         //First, try locks
        Task.WaitAll(Enumerable.Range(1, numTasks)
           .Select(x => Task.Factory.StartNew(() =>
            {
                for (i = 0; i < collSize / numTasks; i++)
                {
                    lock (list1_lock)
                    {
                        lst1.Add(x);
                    }
                }
            })).ToArray());

        sWatch.Stop();
        Console.WriteLine("lock test. Elapsed = {0}", 
            sWatch.Elapsed.TotalSeconds);

        // now try concurrentBag
        sWatch.Restart();
        Task.WaitAll(Enumerable.Range(1, numTasks).
                Select(x => Task.Factory.StartNew(() =>
            {
                for (i = 0; i < collSize / numTasks; i++)
                {
                    concBag.Add(x);
                }
            })).ToArray());

        sWatch.Stop();
        Console.WriteLine("Conc Bag test. Elapsed = {0}",
               sWatch.Elapsed.TotalSeconds);

You really don't need to wrap the `Parallel.For()` in a `Task inserter` in this benchmark. — H H, Jan 24 '11 at 18:35
@Henk, you're correct. I wrapped it in a task because in the full code, there is another task that reads while that task writes. ------------ @Rauhotz, I have a Dual Core box — Tachy, Jan 24 '11 at 18:47

Dan Tao · Answer 1 · 2011-01-24T21:27:46.810

Let me ask you this: how realistic is it that you'd have an application which is constantly adding to a collection and never reading from it? What's the use of such a collection? (This is not a purely rhetorical question. I could imagine there being uses where, e.g., you only read from the collection on shutdown (for logging) or when requested by the user. I believe these scenarios are fairly rare, though.)

This is what your code is simulating. Calling List<T>.Add is going to be lightning-fast in all but the occasional case where the list has to resize its internal array; but this is smoothed out by all the other adds that happen quite quickly. So you're not likely to see a significant amount of contention in this context, especially testing on a personal PC with, e.g., even 8 cores (as you stated you have in a comment somewhere). Maybe you might see more contention on something like a 24-core machine, where many cores can be trying to add to the list literally at the same time.

Contention is much more likely to creep in where you read from your collection, esp. in foreach loops (or LINQ queries which amount to foreach loops under the hood) which require locking the entire operation so that you aren't modifying your collection while iterating over it.

If you can realistically reproduce this scenario, I believe you will see ConcurrentBag<T> scale much better than your current test is showing.

Update: Here is a program I wrote to compare these collections in the scenario I described above (multiple writers, many readers). Running 25 trials with a collection size of 10000 and 8 reader threads, I got the following results:

Took 529.0095 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 39.5237 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 309.4475 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 81.1967 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 228.7669 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 164.8376 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
[ ... ]
Average list time: 176.072456 ms.
Average bag time: 59.603656 ms.

So clearly it depends on exactly what you're doing with these collections.

I do read from it but as I mentioned in the original post, the amount of time it took was proportionate. However, it may not have been a valid test because of how Parallel.For creates new threads differently for the two cases — Tachy, Jan 24 '11 at 21:40
@TriArc: Sorry, I was going from the code you posted rather than the words you wrote (I do that a lot). I now see that you *say* you tested concurrent reads/writes but without seeing the code it's hard to say what's going on there. Did you test with 1 reader and 1 writer? All I can tell you is that my understanding of the collections in `System.Collections.Concurrent` is that they're designed to be **scalable**, so the best tests to measure their benefits will be ones that involve large numbers of readers and/or writers. — Dan Tao, Jan 24 '11 at 21:43
@TriArc: But anyway, I would recommend at least taking a look at the program I posted on pastebin. Maybe comparing that to what you're doing and seeing what makes our tests different will shed some light on the subject. — Dan Tao, Jan 24 '11 at 21:44
Heh, just found this while researching ConcurrentBag for a problem, where, interestingly enough, I never read from the collection (until after the threads finish writing to it and re-join the parent). The particular case involves partitioning data into multiple sets -- so, perhaps not as rare as you might think ;-) — Cameron, Dec 10 '11 at 04:51

score 15 · Answer 2 · answered Jun 07 '12 at 14:51

There seems to be a bug in the .NET Framework 4 that Microsoft fixed in 4.5, it seems they didn't expect ConcurrentBag to be used a lot.

See the following Ayende post for more info

http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

Michael Goldshteyn · Answer 3 · 2012-05-09T15:14:51.423

As a general answer:

Concurrent collections that use locking can be very fast if there is little or no contention for their data (i.e., locks). This is due to the fact that such collection classes are often built using very inexpensive locking primitives, especially when uncontented.
Lockless collections can be slower, because of tricks used to avoid locks and due to other bottlenecks such as false sharing, complexity required to implement their lockless nature leading to cache misses, etc...

To summarize, the decision of which way is faster is highly dependant on the data structures employed and the amount of contention for the locks among other issues (e.g., num readers vs. writers in a shared/exclusive type arrangement).

Your particular example has a very high degree of contention, so I must say I am surprised by the behavior. On the other hand, the amount of work done while the lock is kept is very small, so maybe there is little contention for the lock itself, after all. There could also be deficiencies in the implementation of ConcurrentBag's concurrency handling which makes your particular example (with frequent inserts and no reads) a bad use case for it.

user7116 · Answer 4 · 2011-01-24T20:34:43.770

Looking at the program using MS's contention visualizer shows that ConcurrentBag<T> has a much higher cost associated with parallel insertion than simply locking on a List<T>. One thing I noticed is there appears to be a cost associated with spinning up the 6 threads (used on my machine) to begin the first ConcurrentBag<T> run (cold run). 5 or 6 threads are then used with the List<T> code, which is faster (warm run). Adding another ConcurrentBag<T> run after the list shows it takes less time than the first (warm run).

From what I'm seeing in the contention, a lot of time is spent in the ConcurrentBag<T> implementation allocating memory. Removing the explicit allocation of size from the List<T> code slows it down, but not enough to make a difference.

EDIT: it appears to be that the ConcurrentBag<T> internally keeps a list per Thread.CurrentThread, locks 2-4 times depending on if it is running on a new thread, and performs at least one Interlocked.Exchange. As noted in MSDN: "optimized for scenarios where the same thread will be both producing and consuming data stored in the bag." This is the most likely explanation for your performance decrease versus a raw list.

score 5 · Answer 5 · answered Sep 03 '12 at 22:52

5

This is already resolved in .NET 4.5. The underlying issue was that ThreadLocal, which ConcurrentBag uses, didn’t expect to have a lot of instances. That has been fixed, and now can run fairly fast.

source - The HIGH cost of ConcurrentBag in .NET 4.0

answered Sep 03 '12 at 22:52

Rohit

3,610
7
45
76

Not me, but may be because its a duplicate answer? – nawfal Apr 03 '13 at 11:13
3

That article is terrible; the author is testing creating large numbers of ConcurrentBags and putting them in a list of objects, not actually testing making 1 bag and putting tons of objects into the bag itself. – bcr Jul 11 '14 at 15:06

score 3 · Answer 6 · answered Jan 24 '11 at 18:54

3

As @Darin-Dimitrov said, I suspect that your Parallel.For isn't actually spawning the same number of threads in each of the two results. Try manually creating N threads to ensure that you are actually seeing thread contention in both cases.

answered Jan 24 '11 at 18:54

Chris Shain

50,833
6
93
125

1

They do use the same # of threads. To verify, replace `lst1.Add(i);` with `lst1.Add(ThreadId);` and do a `.Distinct()` on the result. – H H Jan 24 '11 at 18:58
I had thought about this possibility, but hadn't yet tested it(except giving it a cursory glance with a debugger). I tried specifically starting 40 Tasks for each case. The ConcurrentBag was faster for up to a Million , but it's speed slows down for anything larger. It's speed was actually reasonable. – Tachy Jan 24 '11 at 19:29

Darin Dimitrov · Answer 7 · 2011-01-24T18:46:10.850

1

You basically have very few concurrent writes and no contention (Parallel.For doesn't necessarily mean many threads). Try parallelizing the writes and you will observe different results:

class Program
{
    private static object list1_lock = new object();
    private const int collSize = 1000;

    static void Main()
    {
        ConcurrentBagTest();
        LockCollTest();
    }

    private static void ConcurrentBagTest()
    {
        var bag1 = new ConcurrentBag<int>();
        var stopWatch = Stopwatch.StartNew();
        Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
        {
            Thread.Sleep(5);
            bag1.Add(x);
        })).ToArray());
        stopWatch.Stop();
        Console.WriteLine("Elapsed Time = {0}", stopWatch.Elapsed.TotalSeconds);
    }

    private static void LockCollTest()
    {
        var lst1 = new List<int>(collSize);
        var stopWatch = Stopwatch.StartNew();
        Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
        {
            lock (list1_lock)
            {
                Thread.Sleep(5);
                lst1.Add(x);
            }
        })).ToArray());
        stopWatch.Stop();
        Console.WriteLine("Elapsed = {0}", stopWatch.Elapsed.TotalSeconds);
    }
}

edited Jan 24 '11 at 18:46

answered Jan 24 '11 at 18:37

Darin Dimitrov

1,023,142
271
3,287
2,928

2

It's such a fine-grained operation that creating task for every add is going to add too much overhead. – Tim Lloyd Jan 24 '11 at 18:41
@Darin Dimitrov, how many cores do you have? Same question for @chibacity – Bengie Jan 24 '11 at 18:42
@Darin I have 8 cores. It is inversely scaling. – Tim Lloyd Jan 24 '11 at 18:42
@chibacity, yes that's why it is better to reduce the number of adds and probably slow down just a little the operation with a `Thread.Sleep(5);` to create contention. And I have two cores. I will update my answer to include this. – Darin Dimitrov Jan 24 '11 at 18:42
@Darin You have a missing zero off your collSize mate... :) – Tim Lloyd Jan 24 '11 at 18:45
@chibacity, I know I have a missing zero and I did it on purpose. Please see my update. The concurrent bag is almost 4 times faster for me. – Darin Dimitrov Jan 24 '11 at 18:46
2

@Darin Your new example is now dominated by 5ms waits - that is obviously going to skew the results in a non-meaningful way? 5ms is an eternity. – Tim Lloyd Jan 24 '11 at 18:48
@chibacity, yes but this is the case for the lock also. Similar circumstances. I just simulate contention which is what makes the concurrent collections faster than Monitor.Enter and it is more closer to a real world. In the real world the threads are contended. – Darin Dimitrov Jan 24 '11 at 18:49
1

@Darin Yes threads are contended, but 5ms is forever. If we extrapolate this, I can take any similar algorithm, put in 10 year waits and say the results are comparable. – Tim Lloyd Jan 24 '11 at 18:50
@chibacity, the results are not comparable: the concurrent collection performs almost 4 times faster than the lock. No matter whether you multiply it by 5ms or 10 years. Did you run my updated example? – Darin Dimitrov Jan 24 '11 at 18:51
1

@Darin That is because you have placed the 5ms wait inside the lock for the list. It is not a comparable test. You need to place the wait outside the lock for it to be comparable. – Tim Lloyd Jan 24 '11 at 18:52
@chibacity, can't you see that I did the same for the concurrent bag? Placing the sleep outside the lock doesn't create contention which is what I was trying to simulate in the first place. – Darin Dimitrov Jan 24 '11 at 18:53
4

@Darin You need to place the wait outside the lock for it to be comparable. – Tim Lloyd Jan 24 '11 at 18:53
@chibacity, I know but I am trying to simulate contention. As you can see in the case of the concurrent collection the sleep is inside. All I am trying to say is that concurrent bags perform better when there is contention and which is what explains the OPs observed results. – Darin Dimitrov Jan 24 '11 at 18:57
@Darin: No, @Chiba is right. You cannot get 'inside' the locking section of the collection so this is skewing it. – H H Jan 24 '11 at 19:01
@Darin I catch your drift, but you are not simulating the contention in a way that is comparable between tests. For it to be comparable the wait needs to be outside the lock. And 5ms is still 'forever', even 1ms is. – Tim Lloyd Jan 24 '11 at 19:03
@Henk, @chibacity, I get your point, but the thing is that the concurrent bag rarely uses any lock at all, it uses Interlocked.Exchange to swap object references in memory which is much faster. Look at the `AddInternal` method implementation with reflector. It basically only locks twice and that's the power of those collections => they are lock free. – Darin Dimitrov Jan 24 '11 at 19:05
1

@Darin If you want to model other threads being busy, put the wait outside the lock, and better still use a spin waits as Thread.Sleep(milliseconds) and de-scheduling threads is just too long and too brutal. I am assuming you want to model threads being busy doing other stuff. Spin locks\waits are more accurate for modelling this. – Tim Lloyd Jan 24 '11 at 19:07
@Darin: `lock()` uses `Monitor` uses `Interlocked`. Still no explanation. – H H Jan 24 '11 at 19:07
@Henk, Interlocked.Exchange is much faster than Monitor.Enter which is what a lock is. – Darin Dimitrov Jan 24 '11 at 19:08
@chibacity: I used `Thread.SpinWait()` instead of `.Sleep()` on my quad core (100% utilization on all cores), and it seems `ConcurrentBag` scales better: http://i55.tinypic.com/16h6ec0.png – Allon Guralnek Jan 24 '11 at 19:15
1

@Allon I hope you put the SpinWait outside the lock or I'm going to cry :( – Tim Lloyd Jan 24 '11 at 19:17
@Allon, I confirm similar performance of both algorithms with a SpinWait outside the lock. I guess we need @Jon Skeet here :-) – Darin Dimitrov Jan 24 '11 at 19:19
@chibacity: No, I didn't notice the conversation since my comment view was contracted. Putting the `SpinWait` outside the lock shows the same performance for both. – Allon Guralnek Jan 24 '11 at 19:23
1

@Darin: Or to actually attempt to generate usage as described by [MSDN](http://msdn.microsoft.com/en-us/library/dd381779.aspx): `ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.` – Allon Guralnek Jan 24 '11 at 19:28
@Allon That's great. The previous problem was that a Thread.Sleep was placed inside the lock. A wait inside the lock causes a test run time of roughly "(number of tasks * wait time) + execution time" . So if we have a wait of 1 year and 1000 tasks it's going to take 1000 years because each task has to wait for the sleep time. However the concurrent bag test did not suffer from such a problem. If the wait time has been 1 year, the test would have taken 1 year as all tasks wait for the wait time, then *all* attempt to execute concurrently. – Tim Lloyd Jan 24 '11 at 19:29

score 1 · Answer 8 · answered Jan 24 '11 at 21:06

1

My guess is that locks don't experience much contention. I would recommend reading following article: Java theory and practice: Anatomy of a flawed microbenchmark. The article discusses a lock microbenchmark. As stated in the article there are a lot of things to take into consideration in this kind of situations.

answered Jan 24 '11 at 21:06

Edin Dazdarevic

131
2
3

Agreed - well worth a read to see what can go wrong when attempting to benchmark on a small scale. – paytools-steve Oct 20 '11 at 08:51

score 0 · Answer 9 · answered Jan 24 '11 at 18:40

0

It would be interesting to see scaling between the two of them.

Two questions

1) how fast is bag vs list for reading, remember to put a lock on the list

2) how fast is bag vs list for reading while another thread is writing

answered Jan 24 '11 at 18:40

Bengie

1,035
5
10

score 0 · Answer 10 · answered Feb 24 '11 at 00:30

Because the loop body is small, you could try using the Partitioner class Create method...

which enables you to provide a sequential loop for the delegate body, so that the delegate is invoked only once per partition, instead of once per iteration

How to: Speed Up Small Loop Bodies

score 0 · Answer 11 · answered Apr 01 '11 at 21:02

It appears that ConcurrentBag is just slower than the other concurrent collections.

I think it's an implementation problem- ANTS Profiler shows that it is gets bogged down in a couple of places - including an array copy.

Using concurrent dictionary is thousands of times faster.

Why is ConcurrentBag so slow in .Net (4.0)? Am I doing it wrong?

11 Answers11

Linked