16

For some time now I've been structuring my code around methods with no side-effects in order to use parallel linq to speed things up. Along the way I've more than once stumbled on lazy evaluation making things worse instead of better and I would like to know if there are any tools to help with optimizing parallel linq queries.

I ask because I recently refactored some embarrassingly parallel code by modifying some methods and peppering AsParallel in certain key places. The run time went down from 2 minutes to 45 seconds but it was clear from the performance monitor that there were some places where all the cores on the CPU were not being fully utilized. After a few false starts I forced some of the queries to execute by using ToArray and the run time went down even further to 16 seconds. It felt good to reduce the run time of the code but it was also slightly disconcerting because it was not clear where in the code queries needed to be forced with ToArray. Waiting until the last minute for the query to execute was not the optimal strategy but it was not clear at all at what points in the code some of the subqueries needed to be forced in order to utilize all the CPU cores.

As it is I have no idea how to properly pepper ToArray or other methods that force linq computations to execute in order to gain maximum CPU utilization. So are there any general guidelines and tools for optimizing parallel linq queries?

Here's a pseudo-code sample:

var firstQuery = someDictionary.SelectMany(FirstTransformation);
var secondQuery = firstQuery.Select(SecondTransformation);
var thirdQuery = secondQuery.Select(ThirdTransformation).Where(SomeConditionCheck);
var finalQuery = thirdQuery.Select(FinalTransformation).Where(x => x != null);

FirstTransformation, SecondTransformation, ThirdTransformation are all CPU bound and in terms of complexity they are a few 3x3 matrix multiplications and some if branches. SomeConditionCheck is pretty much a null check. FinalTransformation is the most CPU intensive part of the code because it will perform a whole bunch of line-plane intersections and will check polygon containment for those intersections and then extract the intersection that is closest to a certain point on the line.

I have no idea why the places where I put AsParallel reduced the run time of the code as much as it did. I have now reached a local minimum in terms of run time but I have no idea why. It was just dumb luck that I stumbled on it. In case you're wondering the places to put AsParallel are the first and last lines. Putting AsParallel anywhere else will only increase the run time, sometimes by up to 20 seconds. There is also a hidden ToArray hiding in there on the first line.

David K.
  • 6,153
  • 10
  • 47
  • 78
  • 2
    It's the same story for `AsParallel` as it is with non-parallel queries. Nothing happens until the query is evaluated. You have to iterate over or otherwise execute the query. – Anthony Pegram Feb 04 '12 at 05:02
  • 1
    @AnthonyPegram: I understand that. I'm not creating the queries for no reason. They are going to be used at some point in the program but that point might not necessarily be the best place to force the computation. In fact it might even slow things down. Whereas if some of the subqueries are forced along the way then the entire computation can be sped up by a lot. – David K. Feb 04 '12 at 05:04
  • 3
    Some sample code please, so we can imagine something behind the general explanation and also answer with concrete code proposals. – George Mamaladze Feb 04 '12 at 06:24
  • 1
    This sounds like you would be accidentally iterating over a sequence twice. Otherwise there'd be no real benefit from using .ToArray() – Johannes Rudolph Feb 04 '12 at 07:39
  • There is no double iteration. – David K. Feb 04 '12 at 19:33
  • 3
    Use a profiler before you start optimizing by trial and error. – Daniel Mann Feb 04 '12 at 19:36
  • 1
    @davidk01, It's possible that you are triggering cache reset. Here is an example: src.Select(f1).Select(f2).ToList() vs src.Select(f1).ToList().Select(f2).ToList(). In first case LINQ would apply f1 and f2 simultaneously, and in second case step by step. It is possible that f1 and f2 cause access completely different parts of the memory, and constantly causing cache-reset. – Valera Kolupaev Feb 04 '12 at 20:45
  • 1
    real code would make this much easier to answer, especially if you can give both 'before' and 'after' versions with the ToArray additions. Barring that, of the 45 seconds before the addition of ToArray, how many of those were spent actually executing the queries? And during that time, what was the CPU util? – James Manning Feb 05 '12 at 06:16
  • @JamesManning: Real code will be hard to provide. I'm not sure I can distill more than 200 lines of dense code to something tangible with real before and after effects of adding `ToArray` to speed things up. I'm not sure what the overhead was before I added `ToArray` because if I did I wouldn't be asking the question. What is clear to me though is that there is definitely some overhead to nesting queries and peppering `AsParallel` without doing some preliminary analysis on the resulting sizes of the enumerables. – David K. Feb 05 '12 at 07:45
  • IMHO, you should be able to look at the 'relevant' parts of the code (the actual queries/subqueries) and create (or at least attempt) a repro of the problem showing the perf differences with and without ToArray addition. Without something like that or a version of the code you can share with others so we can repro the problem, I'm not sure how we could do anything other than provide general advice? – James Manning Feb 05 '12 at 15:19

2 Answers2

18

There are a couple things going on here:

  1. PLINQ parallelizes collections more efficiently than uncounted IEnumerables. If you have an array, it divides the array length by your number of CPU cores and tasks them out evenly. But if you have an IEnumerable with an unknown length, it does a goofy exponential ramp-up type of thing, where tasks will process 1, 2, 4, 8, etc. elements at a time until it hits the end of the IEnumerable.
  2. By parallelizing all your queries, you're breaking up work into tiny chunks. If you have M parallel queries across N elements, you end up with M*N tasks. There's more thread overhead in this than if you just parallelize the last query, in which case you'd end up with just N tasks.
  3. PLINQ does best when each task takes roughly the same amount of time to process. That way it can divide them up evenly among the cores. By parallelizing each of your queries that have different performance behavior, you have M*N tasks that take varying amounts of time, and PLINQ is not able to schedule them optimally (because it doesn't know ahead of time how long each one might take).

So the overall guideline here is: make sure that before you start you've got an array if possible, and only put AsParallel on the very last query before evaluation. So something like the following should work pretty well:

var firstQuery = someDictionary.SelectMany().ToArray().Select(FirstTransformation);
var secondQuery = firstQuery.Select(SecondTransformation);
var thirdQuery = secondQuery.Select(ThirdTransformation).AsParallel().Where(SomeConditionCheck).ToArray();
var finalQuery = thirdQuery.Select(FinalTransformation).AsParallel().Where(x => x != null);
Joe Strommen
  • 1,236
  • 10
  • 18
7

It is nearly impossible to tell without seeing the actual code. But as a general guideline you should consider to avoid P/LINQ during complex number crunching because the delegate and IEnumerable overhead is just too high. The speed you gain by using threads is very likely eaten up by the convenient abstractions LINQ does provide.

Here is some code which does calculate the sum of 2 integer lists does some int to float comparison and then calculates the cos of it. Pretty basic stuff that can be nicely done with LINQ the .Zip operator ... or the old fashioned way with a for loop.

Update 1 with updated ParallelLinq on my Haswell 8 core machine

  • Linq 0,95s
  • Linq Parallel 0,19s
  • Optimized 0,45s
  • Optimized Parallel 0,08s

Update 1 End

  • LINQ 1,65s
  • Optimized 0,64s
  • Optimized Parallel 0,40s

The time difference is nearly a factor 3 because of the IEnumerable laziness and method call overhead (I did use Release mode x32 Windows 7, .NET 4 dual core). I have tried to use AsParallel in the LINQ version but it did get actually slower (2,3s). If you are data driven you should use the Parallel.For construct to get good scalbility. IEnumerable in itself is a bad candidate for parallelization since

  • You do not know how many work you have before you did enumerate until the end.
  • You cannot do eager chunking because you do not know how fast the IEnumerable will return the next element (could be a web service call or an array index access).

Below is a code sample to illustrate the point. If you want to optimize more towards the bare metal you need first to get rid of abstractions which do cost too much per item. An array access is much much cheaper compared to non inlined MoveNext() and Current method calls.

    class Program
    {
        static void Main(string[] args)
        {
            var A = new List<int>(Enumerable.Range(0, 10*1000*1000));
            var B = new List<int>(Enumerable.Range(0, 10*1000*1000));

            double[] Actual = UseLinq(A, B);
            double[] pActual = UseLinqParallel(A, B);

            var other = Optimized(A, B);
            var pother = OptimizedParallel(A, B);
        }

        private static double[] UseLinq(List<int> A, List<int> B)
        {
            var sw = Stopwatch.StartNew();
            var Merged = A.Zip(B, (a, b) => a + b);
            var Converted = A.Select(a => (float)a);

            var Result = Merged.Zip(Converted, (m, c) => Math.Cos((double)m / ((double)c + 1)));

            double[] Actual = Result.ToArray();
            sw.Stop();

            Console.WriteLine("Linq {0:F2}s", sw.Elapsed.TotalSeconds);
            return Actual;
        }

    private static double[] UseLinqParallel(List<int> A, List<int> B)
    {
        var sw = Stopwatch.StartNew();
        var x = A.AsParallel();
        var y = B.AsParallel();

        var Merged = x.Zip(y, (a, b) => a + b);
        var Converted = x.Select(a => (float)a);

        var Result = Merged.Zip(Converted, (m, c) => Math.Cos((double)m / ((double)c + 1)));

        double[] Actual = Result.ToArray();
        sw.Stop();

        Console.WriteLine("Linq Parallel {0:F2}s", sw.Elapsed.TotalSeconds);
        return Actual;
    }        

        private static double[] OptimizedParallel(List<int> A, List<int> B)
        {
            double[] result = new double[A.Count];
            var sw = Stopwatch.StartNew();
            Parallel.For(0, A.Count, (i) =>
            {
                var sum = A[i] + B[i];
                result[i] = Math.Cos((double)sum / ((double)((float)A[i]) + 1));
            });
            sw.Stop();

            Console.WriteLine("Optimized Parallel {0:F2}s", sw.Elapsed.TotalSeconds);
            return result;
        }

        private static double[] Optimized(List<int> A, List<int> B)
        {
            double[] result = new double[A.Count];
            var sw = Stopwatch.StartNew();
            for(int i=0;i<A.Count;i++)
            {
                var sum = A[i] + B[i];
                result[i] = Math.Cos((double)sum / ((double)((float)A[i]) + 1));
            }
            sw.Stop();

            Console.WriteLine("Optimized {0:F2}s", sw.Elapsed.TotalSeconds);
            return result;
        }
    }
}
Alois Kraus
  • 13,229
  • 1
  • 38
  • 64
  • You are starting the parallel LINQ operation a bit too late to take advantage of PLINQ here. If you create two more variables (`var x = A.AsParallel(); y = B.AsParallel();`) in `UseLinqParallel` and operate on them instead of A and B you should see a big performance gain (3-4x in my case). Also skip the last `AsParallel()` on the result object. – Jakob Möllås Dec 31 '14 at 09:37
  • I have measured again with your changes. Now parallel LINQ is faster as single threaded LINQ but still not much. – Alois Kraus Dec 31 '14 at 13:10
  • Strange, I get: Linq 0,81s Linq Parallel 0,25s Optimized 0,42s Optimized Parallel 0,15s using `var x = A.AsParallel(); var y = B.AsParallel(); var merged = x.Zip(y, (a, b) => a + b); var converted = x.Select(a => (float)a); var result = merged.Zip(converted, (m, c) => Math.Cos((double)m / ((double)c + 1))); var actual = result.ToArray();` – Jakob Möllås Dec 31 '14 at 16:28
  • Ah, you should remove `ToArray()` on `Merged` & `Converted`. – Jakob Möllås Dec 31 '14 at 16:35