1

I have a trivially scalable problem I'm trying to do in MATLAB on a machine with 40 cores and lots of memory. After about 10 cores I see zero decrease in computation time, sometimes the computation time even increases.

While investigating this, I created a simple benchmarking code that seems to illustrate the problem:

clc
clear
p = gcp;
poolSize = p.NumWorkers;
M = 60;
N = 500;

t1 = zeros(1, poolSize);
for n = 1:poolSize
    tic;
        parfor(ImageInd = 1:n*M,n)
            vec1 = rand(1,N);
            vec2 = rand(N,1);
            d = sin(vec2*vec1);
        end
    t1(n) = toc;
end
figure; plot(1:poolSize,(1:poolSize)*M./t1,'b')
hold on; plot(1:poolSize,(1:poolSize)*M./t1(1),'b--')

t2 = zeros(1, poolSize);

for n = 1:poolSize
    tic;
        parfor(ImageInd = 1:n*M,n)
            vec1 = rand(1,N);
            vec2 = rand(N,1);
            d = sin(sin(vec2*vec1));
        end
    t2(n) = toc;
end
figure; plot(1:poolSize,(1:poolSize)*M./t2,'g')
hold on
plot(1:poolSize,(1:poolSize)*M./t2(1),'g--')

The two iterations of the parfor loop in the script above seem to involve the same amount of memory, but differ in the amount of computation in each loop (the sine command should mean an order of magnitude more computation, if I'm not mistaken). The second loop, the more computationally expensive, scales very well with more processors, while the first does not. This seems to be consistent whether I use temporary variables, like in the example, reduced variables or sliced access variables. Am I correct in assuming that this is a memory issue? Could the problem be fixed by switching to another programming language or improved computer architecture? The variables involved should be smaller than the cache of the processors, does MATLAB not utilize the cache well during parallel processing?

Brick
  • 3,998
  • 8
  • 27
  • 47
  • If you monitor your memory usage (e.g. using Windows' Task Manager) do you see this problem happening? Just as a note: you are using `for` loops here, not `parfor`; did you use actual `parfor` loops in the test you ran? Also, see [this answer](http://stackoverflow.com/questions/32146555/saving-time-and-memory-using-parfor-in-matlab/32146700#32146700) – Adriaan Sep 15 '15 at 21:16

1 Answers1

0

There's generally some overhead in parallel execution related to moving memory around, starting threads, etc. Eventually this overhead can swamp the benefit of the parallel execution. I suspect this is what's happening in your case, especially since, as you note, the more computationally intensive version scales better.

I don't know all of the details of parfor's implementation inside Matlab, but I'm under the impression that it's not efficient in the way that it handles memory. In particular, I think that it's copying memory under the covers rather than accessing it "in place." You might do better in a different language if you could exploit a shared memory model through multi-threading, especially if you can employ thread pools that get started just once at the beginning of the program.

This article and its sequels have some related information, both about the Matlab internals and about workarounds using other languages. http://undocumentedmatlab.com/blog/explicit-multi-threading-in-matlab-part1

Brick
  • 3,998
  • 8
  • 27
  • 47