0

This is my first attempt at parallel programming.

I'm writing a test console app before using this in my real app and I can't seem to get it right. When I run this, the parallel search is always faster than the sequential one, but the parallel search never finds the correct value. What am I doing wrong?

I tried it without using a partitioner (just Parallel.For); it was slower than the sequential loop and gave the wrong number. I saw a Microsoft doc that said for simple computations, using Partitioner.Create can speed things up. So I tried that but still got the wrong values. Then I saw Interlocked, but I think I'm using it wrong.

Any help would be greatly appreciated

Random r = new Random();
Stopwatch timer = new Stopwatch();

do {
    // Make and populate a list
    List<short> test = new List<short>();
    for (int x = 0; x <= 10000000; x++)
    {
        test.Add((short)(r.Next(short.MaxValue) * r.NextDouble()));
    }

    // Initialize result variables
    short rMin = short.MaxValue;
    short rMax = 0;

    // Do min/max normal search
    timer.Start();
    foreach (var amp in test)
    {
        rMin = Math.Min(rMin, amp);
        rMax = Math.Max(rMax, amp);
    }
    timer.Stop();

    // Display results
    Console.WriteLine($"rMin: {rMin}  rMax: {rMax}  Time: {timer.ElapsedMilliseconds}");

    // Initialize parallel result variables
    short pMin = short.MaxValue;
    short pMax = 0;

    // Create list partioner
    var rangePortioner = Partitioner.Create(0, test.Count);

    // Do min/max parallel search
    timer.Restart();
    Parallel.ForEach(rangePortioner, (range, loop) =>
    {
        short min = short.MaxValue;
        short max = 0;

        for (int i = range.Item1; i < range.Item2; i++)
        {
            min = Math.Min(min, test[i]);
            max = Math.Max(max, test[i]);
        }
        _ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMin), Math.Min(pMin, min));
        _ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMax), Math.Max(pMax, max));

    });
    timer.Stop();

    // Display results
    Console.WriteLine($"pMin: {pMin}  pMax: {pMax}  Time: {timer.ElapsedMilliseconds}");


    Console.WriteLine("Press enter to run again; any other key to quit");
} while (Console.ReadKey().Key == ConsoleKey.Enter);

Sample output:

rMin: 0  rMax: 32746  Time: 106
pMin: 0  pMax: 32679  Time: 66
Press enter to run again; any other key to quit
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
master_ruko
  • 629
  • 1
  • 4
  • 22

3 Answers3

1

Interlocked.Exchange is thread safe only for Exchange, every Math.Min and Math.Max can be with race condition. You should compute min/max for every batch separately and then join results.

Leszek Mazur
  • 2,443
  • 1
  • 14
  • 28
  • I thought that was what I was doing with having a min/max inside the parallel.ForEach to find the chunk min/max and then doing the Exchange. Or do you mean make a list of chunk min/max's and then a final min/max on the list outside the Parallel.ForEach? – master_ruko Nov 12 '21 at 20:22
  • I tried this and I'm getting the correct values now. One interesting thing I noticed is that the first time through the main do while loop, both min/max loops take about the same time to execute but on subsequent passes, the parallel is about 4x faster. Any idea why this is? Most often the I'll only need to run the min/max once per data set. Is there something I can change to make it faster all the time or should I just not bother with parallel for this task. I only wanted to try it because I'm searching upto 30M data points. – master_ruko Nov 12 '21 at 20:46
  • @master_ruko the JITter must compile the MSIL to native code before running it, so it is expected that you'll get a performance hit the first time you invoke a method of a chunky library like the TPL. [link](https://stackoverflow.com/questions/28950581/c-sharp-takes-more-time-to-do-first-execution), [link](https://stackoverflow.com/questions/20144213/how-to-compile-with-ngen-exe-and-how-to-run-the-native-code-that-is-generated), [link](https://stackoverflow.com/questions/9583460/in-c-on-startup-is-there-a-way-to-force-the-jit-compiler-to-touch-all-msil-co) – Theodor Zoulias Nov 12 '21 at 22:05
  • @TheodorZoulias Thank you, I didn't think about that. – master_ruko Nov 12 '21 at 22:23
1

Using low-lock techniques like the Interlocked class is tricky and advanced. Taking into consideration that your experience in multithreading is not excessive, I would say go with a simple and trusty lock:

object locker = new object();

//...

lock (locker)
{
    pMin = Math.Min(pMin, min);
    pMax = Math.Max(pMax, max);
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • I tried this and I am getting the correct values, but now the parallel is slower than the sequential. – master_ruko Nov 12 '21 at 20:16
  • I haven't tested it, but It shouldn't be slower. The lock-protected block should only be executed once per range, and the partitioner should create only a handful of ranges. – Theodor Zoulias Nov 12 '21 at 21:40
1

The correct way to do a parallel search like this is to compute local values for each thread used, and then merge the values at the end. This ensures that synchronization is only needed at the final phase:

var items = Enumerable.Range(0, 10000).ToList();

int globalMin = int.MaxValue;
int globalMax = int.MinValue;
Parallel.ForEach<int, (int Min, int Max)>(
    items, 
    () => (int.MaxValue, int.MinValue), // Create new min/max values for each thread used
    (item, state, localMinMax) =>
{
    var localMin = Math.Min(item, localMinMax.Min);
    var localMax = Math.Max(item, localMinMax.Max);
    return (localMin, localMax); // return the new min/max values for this thread
},
localMinMax => // called one last time for each thread used
{
    lock(items) // Since this may run concurrently, synchronization is needed
    {
        globalMin = Math.Min(globalMin, localMinMax.Min);
        globalMax = Math.Max(globalMax, localMinMax.Max);
    }
});

As you can see this is quite a bit more complex than a regular loop, and this is not even doing anything fancy like partitioning. An optimized solution would work over larger blocks to reduce overhead, but this is omitted for simplicity, and it looks like the OP is aware such issues already.

Be aware that multi threaded programming is difficult. While it is a great idea to try out such techniques in a playground rather than a real program, I would still suggest that you should start by studying the potential dangers of thread safety, there is fairly easy to find good resources about this.

Not all problems will be as obviously wrong like this, and it is quite easy to cause issues that breaks once in a million, or only when the cpu load is high, or only on single CPU systems, or issues that are only detected long after the code is put into production. It is a good practice to be paranoid whenever multiple threads may read and write the same memory concurrently.

I would also recommend learning about immutable data types, and pure functions, since these are much safer and easier to reason about once multiple threads are involved.

JonasH
  • 28,608
  • 2
  • 10
  • 23
  • Thank you! I will look into what you mentioned. The code worked perfectly, giving the correct value and is consistently ~3x faster – master_ruko Nov 12 '21 at 21:14
  • I almost had this once. I just made the mistake of trying to pass ```(min, max}``` to the ForEach body lambda function instead of localMinMax. :face-palm: – master_ruko Nov 12 '21 at 21:16
  • 1
    This is a good solution (+1). I expect though that using the `Partitioner` to create ranges should be more efficient, because it eliminates the overhead of invoking a lambda for each element (lambdas [are not inlinable](https://stackoverflow.com/questions/1529390/do-lambdas-get-inlined)). And the calculation of this specific parallel loop is so lightweight, that this overhead might not be negligible. – Theodor Zoulias Nov 12 '21 at 21:48