3

There is a task. The array contains arbitrary strings. We need to count how many times each of the strings occurs in the array. Solve the task in one thread and multithreaded, compare the execution time.

For some reason, the single-threaded version runs faster than the multi-threaded one: 90 ms versus 300 ms. How to fix the multi-threaded version so that it runs faster than the single-threaded one?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Threading;

namespace ParallelDictionary
{
    class Program
    {
        static void Main(string[] args)
        {
            List<string> strs = new List<string>();
            for (int i=0; i<1000000; i++)
            {
                strs.Add("qqq");
            }
            for (int i=0;i< 5000; i++)
            {
                strs.Add("aaa");
            }

            F(strs);
            ParallelF(strs);
        }

        private static void F(List<string> strs)
        {
            Dictionary<string, int> freqs = new Dictionary<string, int>();

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            for (int i=0; i<strs.Count; i++)
            {
                if (!freqs.ContainsKey(strs[i]))
                    freqs[strs[i]] = 1;
                else
                    freqs[strs[i]]++;
            }
            stopwatch.Stop();

            Console.WriteLine("single-threaded {0} ms", stopwatch.ElapsedMilliseconds);
            foreach (var kvp in freqs)
            {
                Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);
            }
        }

        private static void ParallelF(List<string> strs)
        {
            ConcurrentDictionary<string, int> freqs = new ConcurrentDictionary<string, int>();

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            Parallel.ForEach(strs, str =>
            {
                freqs.AddOrUpdate(str, 1, (key, value) => value + 1);
            });
            stopwatch.Stop();

            Console.WriteLine("multi-threaded {0} ms", stopwatch.ElapsedMilliseconds);
            foreach (var kvp in freqs)
            {
                Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);
            }
        }
    }
}

Alexey
  • 33
  • 3
  • 2
    Well, of course it's slower. Adding an element to a ConcurrentDictionary is _not_ a costly operation. It happens rather quickly. However, for every single element you want to add to it, you dispatch the respective work item `freqs.AddOrUpdate(str, 1, (key, value) => value + 1)` to a background thread. Doingt this comes with an additional performance cost -- and that's what you are seeing: The perf costs of dispatching work items concurrently is dominating compared to the negligble perf cost of the individual work items here. –  Oct 13 '22 at 11:01
  • 2
    Parallelization only makes sense if the concurrently executed work items do _substantial work_ so that the overhead of parallelization can be amortized. The "fix" therefore is: Make your individual work items do more work, much more work than just a simple and quick `freqs.AddOrUpdate(...)`. But with respect to your actual code example, the real fix is to not use parallelization in this particular scenario. –  Oct 13 '22 at 11:03
  • Also important: You don't do proper perf measurements. It's a complicated topic of how to do it right. Just use Benchmark.NET if you want to measure performance of some code with any degree of confidence. –  Oct 13 '22 at 11:07
  • An additional source of performance cost in your example code: Since all your work items just add to the ConcurrentDictionary and do nothing else, there will also be a high degree of contention between concurrent work items with respect to placing elements into the ConcurrentDictionary, incurring a further perf penalty. –  Oct 13 '22 at 11:10
  • Related questions: [Why was the parallel version slower than the sequential version in this example?](https://stackoverflow.com/questions/10418493/why-was-the-parallel-version-slower-than-the-sequential-version-in-this-example) and [Parallel.ForEach slower than normal foreach](https://stackoverflow.com/questions/39540106/parallel-foreach-slower-than-normal-foreach). – Theodor Zoulias Oct 13 '22 at 11:45

2 Answers2

4

It's possible to make the multi-threaded version a little faster than the single-threaded version by using a partitioner to split the data up into chunks that you process separately.

Then each chunk can be processed into a separate non-concurrent dictionary without needing any locking. Finally at the end of each range, you can update a non-concurrent results dictionary (which you would have to lock).

Something like this:

private static void ParallelF(List<string> strs)
{
    Dictionary<string, int> result = new Dictionary<string, int>();

    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    object locker = new object();

    Parallel.ForEach(Partitioner.Create(0, strs.Count), range => 
    {
        var freqs = new Dictionary<string, int>();

        for (int i = range.Item1; i < range.Item2; ++i)
        {
            if (!freqs.ContainsKey(strs[i]))
                freqs[strs[i]] = 1;
            else
                freqs[strs[i]]++;
        }

        lock (locker)
        { 
            foreach (var kvp in freqs)
            {
                if (!result.ContainsKey(kvp.Key))
                {
                    result[kvp.Key] = kvp.Value;
                }
                else
                {
                    result[kvp.Key] += kvp.Value;
                }
            }
        }
    });

    stopwatch.Stop();

    Console.WriteLine("multi-threaded {0} ms", stopwatch.ElapsedMilliseconds);
    foreach (var kvp in result)
    {
        Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);
    }
}

On my system that gives the following results (for a release build, .NET 6):

single-threaded 50 ms
qqq 1000000
aaa 5000
multi-threaded 26 ms
qqq 1000000
aaa 5000

It's only a little faster... if that's worth it is for you to decide.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
1

Here is another approach, that has similarities with Matthew Watson's Partitioner-based solution, but uses a different API. It uses an advanced Parallel.ForEach overload, that has the signature shown below:

/// <summary>
/// Executes a foreach (For Each in Visual Basic) operation with thread-local data
/// on an System.Collections.IEnumerable in which iterations may run in parallel,
/// loop options can be configured, and the state of the loop can be monitored and
/// manipulated.
/// </summary>
public static ParallelLoopResult ForEach<TSource, TLocal>(
    IEnumerable<TSource> source,
    ParallelOptions parallelOptions,
    Func<TLocal> localInit,
    Func<TSource, ParallelLoopState, TLocal, TLocal> body,
    Action<TLocal> localFinally);

As TLocal is used a local dictionary, that contains the partial results calculated by a single worker thread:

static Dictionary<string, int> GetFrequencies(List<string> source)
{
    Dictionary<string, int> frequencies = new();
    ParallelOptions options = new()
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount
    };
    Parallel.ForEach(source, options, () => new Dictionary<string, int>(),
        (item, state, partialFrequencies) =>
    {
        ref int occurences = ref CollectionsMarshal.GetValueRefOrAddDefault(
            partialFrequencies, item, out bool exists);
        occurences++;
        return partialFrequencies;
    }, partialFrequencies =>
    {
        lock (frequencies)
        {
            foreach ((string item, int partialOccurences) in partialFrequencies)
            {
                ref int occurences = ref CollectionsMarshal.GetValueRefOrAddDefault(
                    frequencies, item, out bool exists);
                occurences += partialOccurences;
            }
        }
    });
    return frequencies;
}

The above code also demonstrates the use of the low-lever CollectionsMarshal.GetValueRefOrAddDefault API, that allows to search and update a dictionary with a single hashing of the key.

I didn't measure it (nor tested it), but I expect it to be slower than Matthew Watson's solution. The reason is that the source is enumerated in a synchronized fashion. If you can handle the complexity, you could consider combining both approaches for optimal performance.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104