How to implement parallel-for in a 4 level nested for loop block

Question

I have to calculate the std and mean of a large data set with respect to quite a few models. The final loop block is nested to four levels.

This is what it looks like:

count = 1;
alpha  = 0.5;
%%%Below if each individual block is to be posterior'd and then average taken 
c = 1;
for i = 1:numel(writers) %no. of writers
    for j = 1: numel(test_feats{i}) %no. of images
        for k = 1: numel(gmm) %no. of models
            for n = 1: size(test_feats{i}{j},1)
                [~, scores(c)] = posterior(gmm{k}, test_feats{i}{j}(n,:));
                c = c + 1;
            end
            c = 1;
            index_kek=find(abs(scores-mean(scores))>alpha*std(scores));
            avg = mean(scores(index_kek)); %using std instead of mean... beacause of ..reasons
            NLL(count) = avg;
            count = count + 1;
        end
        count = 1; %reset count
        NLL_scores{i}(j,:) = NLL; 

    end
    fprintf('***score for model_%d done***\n', i)
end

It works and gives the desired result but it takes 3 days to give me the final calculation, even on my i7 processor. During processing the task manager tells me that only 20% of the cpu is being used, so I would rather put more load on the cpu to get the result faster.

Going by the official help here if I suppose want to make the outer most loop a parfor while keeping the rest normal for all I have to do is to insert integer limits rather than function calls such as size or numel.

So making these changes the above code will become:

count = 1;
alpha  = 0.5;
%%%Below if each individual block is to be posterior'd and then average taken 
c = 1;
num_writers = numel(writers);
num_images = numel(test_feats{1});
num_models = numel(gmm);
num_feats = size(test_feats{1}{1},1);

parfor i = 1:num_writers %no. of writers
    for j = 1: num_images %no. of images
        for k = 1: num_models %no. of models
            for n = 1: num_feats
                [~, scores(c)] = posterior(gmm{k}, test_feats{i}{j}(n,:));
                c = c + 1;
            end
            c = 1;
            index_kek=find(abs(scores-mean(scores))>alpha*std(scores));
            avg = mean(scores(index_kek)); %using std instead of mean... beacause of ..reasons
            NLL(count) = avg;
            count = count + 1;
        end
        count = 1; %reset count
        NLL_scores{i}(j,:) = NLL; 

    end
    fprintf('***score for model_%d done***\n', i)
end

Is this the most optimum way to implement parfor in my case? Can it be improved or optimized further?

Pretty sure a 4 nested loop is very far from optimal implementation of anything. The "best" way is difficult to chose, as it is dependent on the maount o finformation each of the workers will need to be send, in addition to the code. And feels like your code and data is high, so practically no one will be able to answer this accuratedly. Good luck though! — Ander Biguri, Sep 28 '15 at 12:21
The change to explicitly declaring your `numel` before each loop does not change anything; do not use `i` and `j` as a [variable](http://stackoverflow.com/questions/14790740/using-i-and-j-as-variables-in-matlab); I'm pretty sure a few of those loops can be avoided through vectorisation, but without your data I cannot check that. Just as a note: `parfor` is not magic. It's best to first fully optimise your serial code (thus vectorise it) and then go parallel. — Adriaan, Sep 28 '15 at 12:33
You can [reduce your nested loos to one for loop](http://stackoverflow.com/questions/20295579/how-to-nest-multiple-parfor-loops/20295693). — Daniel, Sep 28 '15 at 13:49

score 0 · Answer 1 · answered Sep 28 '15 at 13:48

I couldn't test in Matlab for now but it should be close to a working solution. It has a reduced number of loops and changes a few implementation details but overall it might perform just as fast (or even slower) as your earlier code.

If gmm and test_feats take lots of memory then it is important that parfor is able to determine which peaces of data need to be delivered to which workers. The IDE should warn you if inefficient memory access is detected. This modification is especially useful if num_writers is much less than the number of cores in your CPU, or if it is only slightly larger (like 5 writers for 4 cores would take about as long as 8 writers).

[i_writer i_image i_model] = ndgrid(1:num_writers, 1:num_images, 1:num_models);
idx_combined = [i_writer(:) i_image(:) i_model(:)];
n_combined = size(idx_combined, 1);

NLL_scores = zeros(n_combined, 1);

parfor i_for = 1:n_combined
    i = idx_combined(i_for, 1)
    j = idx_combined(i_for, 2)
    k = idx_combined(i_for, 3)

    % pre-allocate
    scores = zeros(num_feats, 1)

    for i_feat = 1:num_feats
        [~, scores(i_feat)] = posterior(gmm{k}, test_feats{i}{j}(i_feat,:));
    end

    % "find" is redundant here and performs a bit slower, might be insignificant though
    index_kek = abs(scores - mean(scores)) > alpha * std(scores);
    NLL_scores(i_for) = mean(scores(index_kek));
end

How to implement parallel-for in a 4 level nested for loop block

1 Answers1