C# multithreading doesn't use enough cpu

Question

I'm trying to speed up an algorithm that I have written in C#. One of the first thing that I have thought is to make it parallel.

The algorithm have to run over a lot (~Millions) of 2D segments, each segment is independent from the others.

Here is the code:`

    private void DoMapping(Segment[] image, CancellationToken ct, int numTasks = 3)   
    {
        long time = Environment.TickCount;
        LaserOutput = new List<Vector3[]>();
        NormalsOutput = new List<Vector3>();
        Task< Tuple < List<Vector3[]>, List < Vector3 >>>[] tasks = new Task<Tuple<List<Vector3[]>, List<Vector3>>>[numTasks];

        int perTaskSegments = image.Length / numTasks;

        for (int taskIndex = 0; taskIndex < tasks.Length; taskIndex++)
        {

            int nseg = perTaskSegments * (taskIndex + 1) + (taskIndex == tasks.Length - 1 ? image.Length % tasks.Length : 0);
            int from = perTaskSegments * taskIndex;
            Tuple<int, int, Segment[], CancellationToken> obj = new Tuple<int, int, Segment[], CancellationToken>(from, nseg, image, ct);
            tasks[taskIndex] = Task.Factory.StartNew(DoComputationsAction, obj, CancellationToken.None, TaskCreationOptions.LongRunning, TaskScheduler.Default);
        }

        Task.WaitAll(tasks);

        for (int taskIndex = 0; taskIndex < tasks.Length; taskIndex++)
        {
            LaserOutput.AddRange(tasks[taskIndex].Result.Item1);
            NormalsOutput.AddRange(tasks[taskIndex].Result.Item2);
        }
    }

    private Tuple<List<Vector3[]>, List<Vector3>> DoComputationsAction(object obj)
    {
        Tuple<int, int, Segment[], CancellationToken> parm = obj as Tuple<int, int, Segment[], CancellationToken>;
        List<Vector3[]> tmpLaser = new List<Vector3[]>();
        List<Vector3> tmpNormals = new List<Vector3>();

        bool errorOccured = false;
        for (int segCounter = parm.Item1; segCounter < parm.Item2 && !errorOccured; segCounter++)
        {
            if (parm.Item4.IsCancellationRequested)
                break;
            try
            {
                var res = SplitOverMap(parm.Item3[segCounter], (string error) => {
                    errorOccured = true;
                    MessageBox.Show(error, "An error occured", MessageBoxButtons.OK, MessageBoxIcon.Error);
                    Logger.Log("An error occured while mapping data to 3d.");
                });

                if (res != null)
                {
                    tmpLaser.AddRange(res.Item1);
                    tmpNormals.AddRange(res.Item2);
                }
            }
            catch (Exception e)
            {
                Logger.Log("An error occured while calculating 3d map. Skipping polyline." + e.Message);
            }
        }

        return new Tuple<List<Vector3[]>, List<Vector3>>(tmpLaser, tmpNormals);
    }`

Inside SplitOverMap a query to a spatial data structure (QTree) is performed, then some computations occurs.

No locks are perfomed during the whole process. No Disk is used.

Do you have any suggestions on what could be causing the cpu to reach only 40-60 usage?

I also have tried to change num task to 4, 6 and 8. There are no major changes.

I am thinking about the GC but there is not a whole lot I can do to prevent it from running.

EDIT: By reducing the memory usage of some of the classes I have managed to improve a little bit the cpu usage, now it runs around 70%.

On the other hand, by raising the level treshold of the QuadTree I have obtained substantial performance improvement.

Each single thread can only utalize a single physical core. Maybe that is what you are facing. Also `Task.WaitAll(tasks);` will only move on when all tasks are completed, thus some of them might be already completed before all of them completes. — Karolis Kajenas, Jun 16 '17 at 05:34
You need to provide a [mcve]. We really should have code that we can copy paste and run to see the issue. Ideally too we should have a non-parallel version of this code so that we can see what the basic computation is. — Enigmativity, Jun 16 '17 at 05:36
Not sure what you intention is but you can make it parallel and check. Simple way to do that would to use Parallel for loop. You can also set the processor affinity which again depends on the processor cores. — Souvik Ghosh, Jun 16 '17 at 05:43
Enigmativity Unfortunately i can't provide you a minimal example, because the code runs in a framework and to make it work i would have to paste here half the application. Plus i am developing this application for a company and i can't share too much of the algorithm. @Souvik the otput must be ordered, so using Parallel.For would make the things more complicate, plus there is the overhead of creating a million delegates. — deight, Jun 16 '17 at 05:43
Then I would suggest you to create asynchronous function tasks and call them within the for loop. Since you need an ordered for loop I think it is good to reduce the amount of execution time within each iteration of the loop. — Souvik Ghosh, Jun 16 '17 at 05:56
I would have also a look at the TPL DataFlow to get better scaling results especially when you have more operations before or after this code — Boas Enkler, Jun 16 '17 at 06:46
A lot of DIY here, your parallelism is overcomplicated and that makes it hard to spot any problems. The basic approach is to use the std tools, Parallel.ForEach() or PLinq's AsParallel(). Then when you still have problems they will be easier to find and resolve. — H H, Jun 16 '17 at 08:35
@HenkHolterman Read above. Using a profiler i have noticed that the small heap is heavily used, while the large heap is almost empty. My guess is that the heap is a bottleneck.. — deight, Jun 16 '17 at 08:47
A very wild guess. In what way would a Heap stall the CPU? The small heap is always heavily used, usually not a problem. Find the readout for %time_in_GC . — H H, Jun 16 '17 at 09:37

KBO · Answer 1 · 2017-06-16T07:57:44.200

Because there are no dependencies between your segments that would require additional synchronization, I suggest to have a look at the Task Parallel Library (TPL). Parallel.For or Parallel.ForEach may be interesting for you.

To optimize your existing code, there are several options:

Remove TaskCreationOptions.LongRunning. It may spawn new threads, and that is time-consuming.
Create your own task scheduler and give the underlying threads a higher priority. It can be also used when experimenting with the TPL parallel loops. Currently, you are using the default thread pool, that may be used/blocked by other components.

Update: See also lowering priority of Task.Factory.StartNew thread on howto create your custom task scheduler with different priorities. It works very well, I used that for several projects. See also Stephen Toub's blog.

C# multithreading doesn't use enough cpu

1 Answers1