1

I ran this code in a computer with 44 workers. However, each iteration in parallel is slower than in serial mode, though the total execution time for the loop as a whole goes down.

  template=cell(31,1);
  for i=1:31
     template{i}=rand(i+24);
  end
  parfor i=1:5
     img=rand(800,1280+i); % It's an other function that gives me the values of img ,here it's just an example
     tic
     cellfun (@(t) normxcorr2 ( t ,img),template,'UniformOutput',0);
     toc
  end

As a result, elapsed time in each loop is approximately 18s. When I change the parfor to for, the time elapsed is approximately 6.7s in each loop.

Can you explain me why in this case the parfor loop is slower than the for loop? I checked the MATLAB documentation and also similar questions, but it didn't help me.

Note : the total time of execution of the script is faster for the parfor version, I just want to understand why the function cellfun is 3 times slower in parallel version.

Adriaan
  • 17,741
  • 7
  • 42
  • 75
ransa
  • 71
  • 10
  • I imagine your CPU cores are much faster than your GPU cores. I'm assuming if you have 44 workers must be using matlab for gpu, as you're unlikely to have a computer with 44 CPU cores. – Garr Godfrey Apr 21 '17 at 09:28
  • @GarrGodfrey IIRC, `parfor` uses CPU cores by default, not the GPU cores. – Paul Brodersen Apr 21 '17 at 09:33
  • @ransa One potential explanation is that before running the computation within the parfor loop, the data needs to be partitioned and then send to the individual workers. Presumably that overhead is larger than the efficiency increase. Hard to say without knowing seeing the data and the function. – Paul Brodersen Apr 21 '17 at 09:35
  • You might want to read up on [this question](http://stackoverflow.com/q/32095552/5211833) on broadcasting data to workers and on [this one](http://stackoverflow.com/q/32146555/5211833) for time and memory savings using `parfor` (cc @Paul) – Adriaan Apr 21 '17 at 09:41
  • By your `for` version you mean a simple substitution of `parfor` with `for` I take it? Additionally, having 44 workers is not really relevant when you have only 5 instances in your loop. Either it will use just 5 workers (which have to be physical CPU cores, so I hope you're on a server), or it will use all 44 regardless whether you're using `for` or `parfor` if the functions have implicit multithreading (see my previous comment, second link for that). If the latter is the case, it's simply slower because variables need to be broadcast to all workers, which takes more time than the computation – Adriaan Apr 21 '17 at 09:44
  • @Adriaan yes i juste substitute {parfor} by {for}. I read fft2 / ifft2 in normxcorr2 uses multicore to be fast,that's why i think it's slower – ransa Apr 21 '17 at 12:24

1 Answers1

2

Check CPU usage.

I believe main reason here is that stuff like fft (which is a part of xcorr most likely) will already use more than a single core. I can't test parfor right now, but ordinary for loop already has about 70% CPU utilization on my 4C/4T CPU with your code. So, parfor can at best fill the remaining 30% (on my computer), but will obviously then run each instance slower.

Zizy Archer
  • 1,392
  • 7
  • 11
  • You're right,fft2/ ifft2 uses multi cores and thus it's not efficient. Thus my question is how to overcome this ? there is any other way to calculate correlations without using functions that use multi cores ? An other question : i thought that if there is 2 functions that use more than a single core, the priority is for the external parrallel function, so the internal one runs in serial, why it's not the case here? – ransa Apr 21 '17 at 12:18
  • @ransa What do you mean, not efficient? Your stuff requires to do X calculations, and if fft uses 4 cores, it will take 1/4 of time it would with a single core. This is good, not bad :) There is no point in avoiding fft or making it run on a single core. If you do, you will make code run 4x slower, and then parfor will make it run 4x faster again, so you will be exactly where you started. And fft actually should run on a single core when you do parfor (assuming you have as many threads used as there are cores). So this slowdown per cycle is directly a consequence of fft on 1C instead of 4. – Zizy Archer Apr 21 '17 at 13:50
  • @ransa If you want efficiency as in making `parfor` cases run at the speed of `for` for single case (and use only the leftover CPU cycles on other threads), well, I have no idea how to do that. I don't believe you even can - it would require a lot of work from Mathworks and likely wouldn't even work properly most of the time. – Zizy Archer Apr 21 '17 at 13:53