Parallel loop tracking results and performance

Question

I have a .Net Core MVC app which does some moderately heavy proability calculations. I am running a parallel loop over a list of ~2.5 million entries. Since it is a parallel loop, I am using a concurrent bag to hold the result objects. For each iteration, I then find the entry in my concurrent bag and iterate the value, essentially counting the number of times that result occurred. Here is a boiled down example of what is happening:

// results class
public class RandResult
{
    public int id { get; set; }
    public int val { get; set; }        
}

// list of ints I iterate over
var intList = new List<int>();            
for(var i = 0; i < 2500000; i++)
{
    intList.Add(i);   
}

var bagResult = new ConcurrentBag<RandResult>()
{
    new RandResult() { id = 0, val = 0 },
    new RandResult() { id = 1, val = 0 },
    new RandResult() { id = 2, val = 0 },
    new RandResult() { id = 3, val = 0 },
    new RandResult() { id = 4, val = 0 }
};

watch.Restart();
Parallel.ForEach(intList, i =>
{
    bagResult.First(b => b.id == i % 5).val++;
});
timers.Add(watch.ElapsedMilliseconds / 1000.0);  // ~1.3 seconds

You can see the timers I placed in the code to help evaluate speed. Even with this simple calculation here, that loop takes ~1.3 seconds, almost entirely due to the overhead of the concurrent bag. Given this relative inefficiency, I'm looking for alternatives. Here is what I've tried so far:

Using a regular List<RandResult> and a lock:

// takes ~0.6sec
var _lock = new object();
Parallel.ForEach(intList, i =>
{
    lock (_lock)
    {
        listResult.First(b => b.id == i % 5).val++;
    }
 });

Using an Interlock was a bit more complicated

// takes ~0.2sec
var dict = new Dictionary<int, int>()
{
    { 0, 0 },{ 1, 1 },{ 2, 2 },{ 3, 3 },{ 4, 4 }
};
int[] indexes = new int[5] { 0, 1, 2, 3, 4 };
int[] vals= new int[5] { 0, 0, 0, 0, 0 };
Parallel.ForEach(intList, i =>
{
    dict.TryGetValue(i % 5, out int k);    
    Interlocked.Increment(ref vals[k]);
});

This one is more complicated because the Id values won't be consecutive ints so the Dictionary serves as a reverse lookup.

The question is, are there any other options?

Note:
The actual calculation being done is certainly more complex than i%5 but the real question here is about recording the results so that serves for the example. Also, even in the full application there will never be more than 10 entries in the List/Bag of RandResult.

Bonus question: I'm a bit shocked that the ConcurrentBag option was so much slower than everything else. I understand there is quite a bit of overhead involved with parallelism and concurrency in general but that seems excessive. Does anyone know why it is so much slower?

To answer your bonus question, ConcurrentBag is not optimized at all to frequently be enumerated over. It is optimized to push and pop stuff in to a pool of items. — Scott Chamberlain, May 02 '17 at 21:27

Scott Chamberlain · Accepted Answer · 2017-05-02T22:04:15.567

1

By switching to a ConcurrentDictionary you can use it's AddOrUpdate function to get efficient lookups and thread safe incrementation.

var dict = new ConcurrentDictionary<int, int>();

Parallel.ForEach(intList, i =>
{
    dict.AddOrUpdate(GiveSomeInt(i), 1, (key, value) => value++);
});

The first time you try to access a index it will add a new value of 1, any future calls to the index will return old value + 1. If two threads try to update the value at the same time the value factory function will get re-run when the slower of the two updates that tries to save its value and will then add 1 to the new updated value.

If you wanted to pre-initialize the dictionary you could also do

var dict = new ConcurrentDictionary<int, int>()
{
    { 0, 0 },{ 1, 0 },{ 2, 0 },{ 3, 0 },{ 4, 0 }
};

Parallel.ForEach(intList, i =>
{
    dict.AddOrUpdate(GiveSomeInt(i), 1, (key, value) => value++);
});

To answer your bonus question, ConcurrentBag is not optimized at all to frequently be enumerated over, every time you call bagResult.GetEnumerator() (which .First( does behind the scenes) it has to clone the bag and generate a frozen in time snapshot. It is optimized to push and pop stuff in to a pool of items. Using .First( was killing your performance.

edited May 02 '17 at 22:04

answered May 02 '17 at 21:44

Scott Chamberlain

124,994
33
282
431

My (very bad) assumption was that since there were only 5 elements in the container then the `.First(...)` wouldn't be inefficient enough to worry about. Clearly I was totally wrong. – nurdyguy May 02 '17 at 21:49
Concurrent bag is optimized for multi-producer/multi-consumer use. Any time you want to use a `IEnumerable` over it it has to clone itself and make you a frozen in time snapshot of the bag. It is that snapshot generation which is slowing it way down. – Scott Chamberlain May 02 '17 at 21:50
The ConcurrentDictionary is completely unnecessary because the items are fixed in place; nothing is ever removed nor added. The increment of `value`, on the other hand, may need a lock, although in a lot of cases the CPU will perform atomic increments on it if the dictionary's memory map is aligned, which the CLR will do automatically. – John Wu May 02 '17 at 21:55
@JohnWu The main thing we are using the dictionary for is its pre-built `AddOrUpdate` function and fast lookups. Although the id's where 0-5 in the example the OP notes in his final example's comments that the real ids won't be consecutive numbers so you can't just index directly in to them. I have replaced `i % 5` with `GiveSomeInt(i)` to show that better. – Scott Chamberlain May 02 '17 at 21:56
who down voted this?? weird. Only extra I'd add to this answer is that the first example isn't thread safe – Keith Nicholas May 02 '17 at 21:59
@KeithNicholas The first example is thread safe, if two threads try to insert at the same time the "slower" thread will have it's insert re-tried and it will call the `Func valueFactory` function instead of using the `TValue addValue`. To updates at the same time do the same thing, the "slower" thread has it's `valueFactory` called a 2nd time with the `TValue` returned from the "faster" thread. That is why it is important to not have any observable side effects from code that is run inside the `valueFactory` – Scott Chamberlain May 02 '17 at 22:02
I mean of the OPs first example. – Keith Nicholas May 02 '17 at 22:05
In the `_lock` example in the question, if I changed the `.First(...)` to be purely an array index lookup, would that be thread safe? – nurdyguy May 02 '17 at 22:07

score 0 · Answer 2 · edited May 23 '17 at 11:47

Not sure why you're using a concurrent bag. It's not like you are adding or removing items. And I don't think it solves any parallelism issues for you-- the only thing the bag gives you is thread-safe access to the bag, not thread-safe access to the RandResult items inside the bag.

If it were me, I'd use a simple dictionary, with a key of id. Or, if id is always a sequential integer, use an array. That would be much faster.

As for the concurrency issue-- all you need to do is use Interlocked.Increment instead of val++. That will give you enough thread safety for this specific problem. You do not need to synchronize access to the bag/list/dictionary/array at all, since all threads use only read-only access with respect to that object. Depending on your platform, Interlocked.Increment not incur any overhead at all, since increments are automatically atomic in many situations-- they are 99% likely to be automatically atomic on a Windows system using the current CLR.

var results = new int[5];

var intList = new List<int>();            
for(var i = 0; i < 2500000; i++)
{
    intList.Add(i);   
}

watch.Restart();
Parallel.ForEach(intList, i =>
{
    Interlocked.Increment(ref results[i % 5]);
});
timers.Add(watch.ElapsedMilliseconds / 1000.0);  // ~1.3 seconds

Additional performance note: Since the elements in your results list are so close together in memory, they are likely to cause CPU cache contention. Typically your CPU will use cache bursts to move small chunks of memory into L1 or L2 cache (which is separate per core); while it is cached, access to those memory locations on the main memory board will be locked out. So essentially all of your cores will lock each other out if they are working on parts of the memory that are within a certain distance (a "cache line") of each other; this can result in performance that is so bad that it is even slower than running the algorithm in series. This problem is called "false sharing."

To avoid the issue, you might want to pad the items in your result list so they are large enough to exceed the cache burst size (which is CPU-dependent). Since the array will only contain 10 items, you could pad them each with a big 128 byte block of nothingness and not incur much overhead.

For more on this issue, see this article.

Please include an example of using a Dictionary and InterlockedIncrement at the same time without using an additional array like the OP did in his last example — Scott Chamberlain, May 02 '17 at 21:39
I didn't realize that ConcurrentBag was only safe at a container level and not at an element level. — nurdyguy, May 02 '17 at 21:47
@nurdguy There is no magic with it, the object returned by `.First(b => b.id == i % 5)` has no thread safety gauntees, it is the exact same as if you did `var tmp = bagResult.First(b => b.id == i % 5); tmp.val++;` — Scott Chamberlain, May 02 '17 at 21:49
I'm not sure what this answer is adding over what the OP stated in his question? all you did was get rid of the dictionary lookup which wasn't really what was being asked? — Keith Nicholas, May 02 '17 at 22:03
I'm not adding, I'm removing. Everything else on this page is way too complicated for what is a very simple problem. — John Wu, May 02 '17 at 22:36

Parallel loop tracking results and performance

2 Answers2