Performance reading large Dataset from Multiple parallel threads

Question

I’m working on a Genetic Machine Learning project developed in .Net (as opposed to Matlab – My Norm). I’m no pro .net coder so excuse any noobish implementations.

The project itself is huge so I won’t bore you with the full details but basically a population of Artificial Neural Networks (like decision trees) are each evaluated on a problem domain that in this case uses a stream of sensory inputs. The top performers in the population are allowed to breed and produced offspring (that inherit tendencies from both parents) and the poor performers are killed off or breed-out of the population. Evolution continues until an acceptable solution is found. Once found, the final evolved ‘Network’ is extracted from the lab and placed in a light-weight real-world application. The technique can be used to develop very complex control solution that would be almost impossible or too time consuming to program normally, like automated Car driving, mechanical stability control, datacentre load balancing etc, etc.

Anyway, the project has been a huge success so far and is producing amazing results, but the only problem is the very slow performance once I move to larger datasets. I’m hoping is just my code, so would really appreciate some expert help.

In this project, convergence to a solution close to an ideal can often take around 7 days of processing! Just making a little tweak to a parameter and waiting for results is just too painful.

Basically, multiple parallel threads need to read sequential sections of a very large dataset (the data does not change once loaded). The dataset consists of around 300 to 1000 Doubles in a row and anything over 500k rows. As the dataset can exceed the .Net object limit of 2GB, it can’t be stored in normal 2d array – The simplest way round this was to use a Generic List of single arrays.

The parallel scalability seems to be a big limiting factor as running the code on a beast of a server with 32 Xeon cores that normally eats Big dataset for breakfast does not yield much of a performance gain over a Corei3 desktop!

Performance gains quickly dwindle away as the number of cores increases.

From profiling the code (with my limited knowledge) I get the impression that there is a huge amount of contention reading the dataset from multiple threads.

I’ve tried experimenting with different dataset implementations using Jagged arrays and various concurrent collections but to no avail.

I’ve knocked up a quick and dirty bit of code for benchmarking that is similar to the core implementation of the original and still exhibits the similar read performance issues and parallel scalability issues.

Any thoughts or suggestions would be much appreciated or confirmation that this is the best I’m going to get.

Many thanks

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading.Tasks;

//Benchmark script to time how long it takes to read dataset per iteration

namespace Benchmark_Simple
{
class Program

{
    public static TrainingDataSet _DataSet;
    public static int Features = 100; //Real test will require 300+
    public static int Rows = 200000; //Real test will require 500K+
    public static int _PopulationSize = 500; //Real test will require 1000+
    public static int _Iterations = 10;
    public static List<NeuralNetwork> _NeuralNetworkPopulation = new List<NeuralNetwork>();

    static void Main()
    {
        Stopwatch _Stopwatch = new Stopwatch();

        //Create Dataset
        Console.WriteLine("Creating Training DataSet");
        _DataSet = new TrainingDataSet(Features, Rows);
        Console.WriteLine("Finished Creating Training DataSet");

        //Create Neural Network Population
        for (int i = 0; i <= _PopulationSize - 1; i++)
        {
            _NeuralNetworkPopulation.Add(new NeuralNetwork());
        }

        //Main Loop
        for (int i = 0; i <= _Iterations - 1; i++)
        {
            _Stopwatch.Restart();

            Parallel.ForEach(_NeuralNetworkPopulation, _Network => { EvaluateNetwork(_Network); });

            //######## Removed for simplicity ##########

            //Run Evolutionary Genetic Algorithm on population - I.E. Breed the strong, kill of the weak

            //##########################################
            //Repeat until acceptable solution is found

            Console.WriteLine("Iteration time: {0}", _Stopwatch.ElapsedMilliseconds / 1000);

            _Stopwatch.Stop();

        }

        Console.ReadLine();

    }

    private static void EvaluateNetwork(NeuralNetwork Network)
    {
        //Evaluate network on 10% of the Training Data at a random starting point

        double Score = 0;

        Random Rand = new Random();

        int Count = (Rows / 100) * 10;

        int RandonStart = Rand.Next(0, Rows - Count);

        //The data must be read sequentially
        for (int i = RandonStart; i <= RandonStart + Count; i++)
        {
            double[] NetworkInputArray = _DataSet.GetDataRow(i);

            //####### Dummy Evaluation - just give it somthing to do for the sake of it
            double[] Temp = new double[NetworkInputArray.Length + 1];
            for (int j = 0; j <= NetworkInputArray.Length - 1; j++)
            {
                Temp[j] = Math.Log(NetworkInputArray[j] * Rand.NextDouble());
            }
            Score += Rand.NextDouble();
            //##################
        }
        Network.Score = Score;
    }

    public class TrainingDataSet
    {
        //Simple demo class of fake data for benchmarking

        private List<double[]> DataList = new List<double[]>();

        public TrainingDataSet(int Features, int Rows)
        {
            Random Rand = new Random();

            for (int i = 1; i <= Rows; i++)
            {
                double[] NewRow = new double[Features];
                for (int j = 0; j <= Features - 1; j++)
                {
                    NewRow[j] = Rand.NextDouble();
                }
                DataList.Add(NewRow);
            }

        }

        public double[] GetDataRow(int Index)
        {
            return DataList[Index];
        }

    }

    public class NeuralNetwork
    {
        //Simple Class to represent a dummy Neural Network - 
        private double _Score;
        public NeuralNetwork()
        {
        }

        public double Score
        {
            get { return _Score; }
            set { _Score = value; }
        }

    }
}
}

One thing I noticed is you are using Parallel.ForEach which will block. Did you look in to ThreadPool.QueueWorkItem and PLINQ? — Learner, Jul 24 '13 at 15:42
in my experience Parallel.Foreach over the entire list has high contention. I get better performance by dividing the totalItems by the number of procs then: `var groupCount = totalItems/numProcessors; Parallel.For(0,groupCount,(groupId)=>{var start = groupId*groupsize; var end = Math.Min(start+groupsize,totalItems); for(int i = start; i < end; i++{ /*do work*/}})` — Handcraftsman, Jul 24 '13 at 15:50

score 4 · Accepted Answer · edited May 23 '17 at 11:48

The first thing is that the only way to answer any performance questions is by profiling the application. I'm using the VS 2012 builtin profiler - there are others https://stackoverflow.com/a/100490/19624

From an initial read through the code, i.e. a static analysis the only thing that jumped out at me was the continual reallocation of Temp inside the loop; this is not efficient and if possible needs moving outside of the loop.

With a profiler you can see what's happening:

Profile Summary

I profiled first using the code you posted, (top marks to you for posting a full compilable example of the problem, if you hadn't I wouldn't be answering this now).

This shows me that the bulk is in the inside of the loop, I moved the allocation to the the Parallel.ForEach loop.

Parallel.ForEach(_NeuralNetworkPopulation, _Network => 
{
    double[] Temp = new double[Features + 1];
    EvaluateNetwork(_Network, Temp); 
});

Original Code

So what I can see from the above is that there is 4.4% wastage on the reallocation; but the probably unsurprising thing is that it is the inner loop that is taking 87.6%.

This takes me to my first rule of optimisation which is to first to review your algorithm rather than optimizing the code. A poor implementation of a good algorithm is usually faster than a highly optimized poor algorithm.

Removing the repeated allocate of Temp changes the picture slightly;

After moving allocation to outer loop

Also worth tuning a bit by specifying the parallelism; I've found that Parallel.ForEach is good enough for what I use it for, but again you may get better results from manually partitioning the work up into queues.

Parallel.ForEach(_NeuralNetworkPopulation, 
                 new ParallelOptions { MaxDegreeOfParallelism = 32 },  
                 _Network => 
            {
                double[] Temp = new double[Features + 1];
                EvaluateNetwork(_Network, Temp); 
            });

Whilst running I'm getting what I'd expect in terms of CPU usage: although my machine was also running another lengthy process which was taking the base level (the peak in the chart below is when profiling this program).

enter image description here

So to summarize

Review the most frequently executed part and come up with new algorithm if possible.
Profile on the target machine
Only when you're sure about (1) above is it then worth looking at optimising the algorithm; considering the following a) Code optimisations b) Memory tuning / partioning of data to keep as much in cache c) Improvements to threading usage

Many thanks for your help in profiling the code and your sound advice. I’m going to try some different core algorithm implementations first rather than trying in vain to polish what I’ve got and use the profiler a lot more as I go. Once I’ve settled on a solid foundation, I’ll then look to code optimisations. Thanks again for your very informative and thorough response – Bag of doughnuts on its way to you sir! — user2615145, Jul 25 '13 at 10:09

Performance reading large Dataset from Multiple parallel threads

1 Answers1