I'm using MATLAB's parallel toolbox on a shiny server (32 physical cores, 64 logical, 512GB RAM). All my parallellization needs can be classified as 'embarrassingly parallel', e.g.:
parfor i_par = 1:n_tot
[CalcOutputs_array(:,:,i_par), AdvancedCalcOutputs_array(i_par)] = CPU_DemandingFunction(InputsStruct_array(i_par));
end
In order to minimize overheads, I've already split the input such that each worker receives only what it needs.
The output, however, can become quite large (1GB-10GB), and data collection from the workers is vastly over-weighting the calculation itself.
I'm positive that this is the case since by lowering the amount of data outputted from the function (I have a flag that controls it) - the runtime improvement factor approaches the # of cores i'm using.
Can that be improved? Is there any remedy for data-collection overheads?