How to speed up routines making use of collections in multithreading scenario

Question

I've an application that makes use of parallelization for processing data.

The main program is in C#, while one of the routine for analyzing data is on an external C++ dll. This library scans data and calls a callback everytime a certain signal is found within the data. Data should be collected, sorted and then stored into HD.

Here is my first simple implementation of the method invoked by the callback and of the method for sorting and storing data:

// collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();

// method invoked by the callback
private void Collect(int type, long time)
{
    lock(locker) { mySignalList.Add(new MySignal(type, time)); }
}

// store signals to disk
private void Store()
{
    // sort the signals
    mySignalList.Sort();
    // file is a object that manages the writing of data to a FileStream
    file.Write(mySignalList.ToArray());
}

Data is made up of a bidimensional array (short[][] data) of size 10000 x n, with n variable. I use parallelization in this way:

Parallel.For(0, 10000, (int i) =>
{
    // wrapper for the external c++ dll
    ProcessData(data[i]);
}

Now for each of the 10000 arrays I estimate that 0 to 4 callbacks could be fired. I'm facing a bottleneck and given that my CPU resources are not over-utilized, I suppose that the lock (together with thousand of callbacks) is the problem (am I right or there could be something else?). I've tried the ConcurrentBag collection but performances are still worse (in line with other user findings).

I thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?

[EDIT] I've followed the Reed Copsey's suggestion and I used the following solution (according to the profiler of VS2010, before the burden for locking and adding to the list was taking 15% of the resources, while now only 1%):

// master collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// thread-local storage of data (each thread is working on its List<MySignal>)
ThreadLocal<List<MySignal>> threadLocal;

// analyze data
private void AnalizeData()
{
    using(threadLocal = new ThreadLocal<List<MySignal>>(() => 
        { return new List<MySignal>(); }))
    {
        Parallel.For<int>(0, 10000,
        () =>
        { return 0;},
        (i, loopState, localState) =>
        {
            // wrapper for the external c++ dll
            ProcessData(data[i]);
            return 0;
        },
        (localState) =>
        {
            lock(this)
            {
                // add thread-local lists to the master collection
                mySignalList.AddRange(local.Value);
                local.Value.Clear();
            }
        });
    }
}

// method invoked by the callback
private void Collect(int type, long time)
{
    local.Value.Add(new MySignal(type, time));
}

score 1 · Accepted Answer · answered Feb 21 '11 at 17:42

thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?

You might want to look at using ThreadLocal<T> to hold your collections. This automatically allocates a separate collection per thread.

That being said, there are overloads of Parallel.For which work with local state, and have a collection pass at the end. This, potentially, would allow you to spawn your ProcessData wrapper, where each loop body was working on its own collection, and then recombine at the end. This would, potentially, eliminate the need for locking (since each thread is working on it's own data set) until the recombination phase, which happens once per thread (instead of once per task,ie: 10000 times). This could reduce the number of locks you're taking from ~25000 (0-4*10000) down to a few (system and algorithm dependent, but on a quad core system, probably around 10 in my experience).

For details, see my blog post on aggregating data with Parallel.For/ForEach. It demonstrates the overloads and explains how they work in more detail.

This sounds as the solution I was searching for. I was trying it but I'm not sure on how to do. Should I use Parallel.For>(...) so that the collection is initialized in the localInit and then passed to the localFinally delegate? The problem is that the body will never use the collection and has no return type (the callback will cause the collection to be filled up). Maybe you were suggesting of mixing this method with the ThreadLocal use, but in this case I'm missing how to do. Could you give me some more details? — Mauro Ganswer, Feb 21 '11 at 20:28

score 1 · Answer 2 · answered Feb 21 '11 at 17:58

1

You don't say how much of a "bottleneck" you're encountering. But let's look at the locks.

On my machine (quad core, 2.4 GHz), a lock costs about 70 nanoseconds if it's not contended. I don't know how long it takes to add an item to a list, but I can't imagine that it takes more than a few microseconds. But let's it takes 100 microseconds (I would be very surprised to find that it's even 10 microseconds) to add an item to the list, taking into account lock contention. So if you're adding 40,000 items to the list, that's 4,000,000 microseconds, or 4 seconds. And I would expect one core to be pegged if this were the case.

I haven't used ConcurrentBag, but I've found the performance of BlockingCollection to be very good.

I suspect, though, that your bottleneck is somewhere else. Have you done any profiling?

answered Feb 21 '11 at 17:58

Jim Mischel

131,090
20
188
351

Well it's a real time application that makes other processing in the meantime, but let's try to explain why I came to that conclusion. I've an 8 core 2.5GHz machine and I profiled the application with Visual Studio 2010 concurrency [profiler](http://msdn.microsoft.com/en-us/magazine/ee336027.aspx). In the "CPU Utilization" graph, normally it occupies about 70% and I see that there is idle space on the logical cores that could be used. In the "Threads" graph I see that there are many threads (more than 8) that spend most of their time in Synchronization and few in executing the Collect method. – Mauro Ganswer Feb 21 '11 at 18:58
@Ganswer: Is something else locking your signals list? I assume you're going to process them at some point, meaning that you have to lock the list to get the signals out? – Jim Mischel Feb 21 '11 at 19:17
No, I do not use other locks. Simply when the Parallel.For (that blocks the further code execution until it finishes all its work) finishes I process the data with the Store method – Mauro Ganswer Feb 21 '11 at 19:30

Yochai Timmer · Answer 3 · 2011-02-21T18:09:15.473

The basic collections in C# aren't thread safe.

The problem you're having is due to the fact that you're locking the entire collection just to call an add() method.

You could create a thread-safe collection that only locks single elements inside the collection, instead of the whole collection.

Lets look at a linked list for example.

Implement an add(item (or list)) method that does the following:

Lock collection.
A = get last item.
set last item reference to the new item (or last item in new list).
lock last item (A).
unclock collection.
add new items/list to the end of A.
unlock locked item.

This will lock the whole collection for just 3 simple tasks when adding.

Then when iterating over the list, just do a trylock() on each object. if it's locked, wait for the lock to be free (that way you're sure that the add() finished).
In C# you can do an empty lock() block on the object as a trylock(). So now you can add safely and still iterate over the list at the same time.

Similar solutions can be implemented for the other commands if needed.

score 0 · Answer 4 · answered Feb 21 '11 at 17:50

Any built-in solution for a collection is going to involve some locking. There may be ways to avoid it, perhaps by segregating the actual data constructs being read/written, but you're going to have to lock SOMEWHERE.

Also, understand that Parallel.For() will use the thread pool. While simple to implement, you lose fine-grained control over creation/destruction of threads, and the thread pool involves some serious overhead when starting up a big parallel task.

From a conceptual standpoint, I would try two things in tandem to speed up this algorithm:

Create threads yourself, using the Thread class. This frees you from the scheduling slowdowns of the thread pool; a thread starts processing (or waiting for CPU time) when you tell it to start, instead of the thread pool feeding requests for threads into its internal workings at its own pace. You should be aware of the number of threads you have going at once; the rule of thumb is that the benefits of multithreading are overcome by the overhead when you have more than twice the number of active threads as "execution units" available to execute threads. However, you should be able to architect a system that takes this into account relatively simply.
Segregate the collection of results, by creating a dictionary of collections of results. Each results collection is keyed to some token carried by the thread doing the processing and passed to the callback. The dictionary can have multiple elements READ at one time without locking, and as each thread is WRITING to a different collection within the Dictionary there shouldn't be a need to lock those lists (and even if you did lock them you wouldn't be blocking other threads). The result is that the only collection that has to be locked such that it would block threads is the main dictionary, when a new collection for a new thread is added to it. That shouldn't have to happen often if you're smart about recycling tokens.

Thanks for your reply! Yes I was thinking to something similar, but I wanted to avoid to pass the token identifying the thread to the callback, since the same callback is used in other contests and it would lose its reusability. For this reason I was thinking to use thread unique ID (that I need to know before launching the threads) and then access the collection inside the dictionary by Thread.CurrentThread.ManagedThreadId. — Mauro Ganswer, Feb 21 '11 at 19:22
That should work just as well; however, you'll be creating more sub-collections, because you'll be creating 10k total threads over the course of the algorithm (even if you throttle it back to X threads running concurrently); if you use the managed thread ID, each one will be different and require a different sub-collection keyed to it. — KeithS, Feb 21 '11 at 19:48
" understand that Parallel.For() will use the thread pool" - That's not necessary true. You can have Parallel.For use a custom task scheduler, which may or may not use the threadpool... — Reed Copsey, Feb 21 '11 at 20:43

How to speed up routines making use of collections in multithreading scenario

4 Answers4