Parallel computing in Octave on a single machine -- package and example

Question

I would like to parallelize a for loop in Octave on a single machine (as opposed to a cluster). I asked a question about a parallel version of Octave a while ago parallel computing in octave

And the answer suggested that I download a parallel computing package, which I did. The package seems largely geared to cluster computing, but it did mention single machine parallel computing, but was not clear on how to run even a parallel loop.

I also found another question on SO about this, but I did not find a good answer for parallelizing loops in Octave: Running portions of a loop in parallel with Octave?

Does anyone know where I can find an example of running a for loop in parallel in Octave???

If this feature (is a feature?) of Octave isn't well documented, this might not be such a good idea. — Ira Baxter, May 09 '12 at 17:27
Here, they claim that this is a feature of Octave: http://octave.sourceforge.net/parallel/ But, it does not seem to be well documented. — db1234, May 09 '12 at 17:35
Presumably you're trying to use parallelism to improve performance; if that's the case, you _don't_ want to use for loops, in serial *or* in parallel, as for loops in Octave (and IDL, and to a lesser extent, Matlab) are slow slow slow. — Jonathan Dursi, May 10 '12 at 13:33
@JonathanDursi: I have two comments. One, how can I bypass using a for loop to, say, compute a function f(x,y) on a 2D grid? Also, even if for loops in Octave are slow, they would presumably be faster when parallelized, so it still seems worth knowing whether one can parallelize a for loop in Octave. — db1234, May 10 '12 at 21:44
Ok, I'm going to have to post something as an answer, since it won't fit in a comment: — Jonathan Dursi, May 10 '12 at 22:16

user7610 · Answer 1 · 2014-02-23T13:56:24.047

I am computing large number of RGB histograms. I need to use explicit loops to do it. Therefore computation of each histogram takes noticeable time. For this reason running the computations in parallel makes sense. In Octave there is an (experimental) function parcellfun written by Jaroslav Hajek that can be used to do it.

My original loop

histograms = zeros(size(files,2), bins^3);
  % calculate histogram for each image
  for c = 1 : size(files,2)
    I = imread(fullfile(dir, files{c}));
    h = myhistRGB(I, bins);
    histograms(c, :) = h(:); % change to 1D vector
  end

To use parcellfun, I need to refactor the body of my loop into a separate function.

function histogram = loadhistogramp(file)
  I = imread(fullfile('.', file));
  h = myhistRGB(I, 8);
  histogram = h(:); % change to 1D vector
end

then I can call it like this

histograms = parcellfun(8, @loadhistogramp, files);

I did a small benchmark on my computer. It is 4 physical cores with Intel HyperThreading enabled.

My original code

tic(); histograms2 = loadhistograms('images.txt', 8); toc();
warning: your version of GraphicsMagick limits images to 8 bits per pixel
Elapsed time is 107.515 seconds.

With parcellfun

octave:1> pkg load general; tic(); histograms = loadhistogramsp('images.txt', 8); toc();
parcellfun: 0/178 jobs donewarning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
warning: your version of GraphicsMagick limits images to 8 bits per pixel
parcellfun: 178/178 jobs done
Elapsed time is 29.02 seconds.

(The results from the parallel and serial version were the same (only transposed).

octave:6> sum(sum((histograms'.-histograms2).^2))
ans = 0

When I repeated this several times, the running times were pretty much the same all the time. The parallel version was running around 30 second (+- approx 2s) with both 4, 8 and also 16 subprocesses)

Sir, I've read some references about octave parallel package, but I still don't understand why we should put "@" before the function. Can you explain it ? thx — affhendrawan, May 13 '17 at 08:51
@affhendrawan You are supposed to post a new SO question for this, according to SO rules. I do not remember Octave all that well, I last used it at college. Luckily for you, the question already exists on this site http://stackoverflow.com/questions/25355028/what-is-the-purpose-of-the-symbol-in-front-of-a-variable-in-octave — user7610, May 13 '17 at 09:56

Jonathan Dursi · Answer 2 · 2012-05-10T22:31:19.350

Octave loops are slow, slow, slow and you're far better off expressing things in terms of array-wise operations. Let's take the example of evaluating a simple trig function over a 2d domain, as in this 3d octave graphics example (but with a more realistic number of points for computation, as opposed to plotting):

vectorized.m:

tic()
x = -2:0.01:2;
y = -2:0.01:2;
[xx,yy] = meshgrid(x,y);
z = sin(xx.^2-yy.^2);
toc()

Converting it to for loops gives us forloops.m:

tic()
x = -2:0.01:2;
y = -2:0.01:2;
z = zeros(401,401);
for i=1:401
    for j=1:401
        lx = x(i);
        ly = y(j);
        z(i,j) = sin(lx^2 - ly^2);
    endfor        
endfor
toc()

Note that already the vectorized version "wins" in being simpler and clearer to read, but there's another important advantage, too; the timings are dramatically different:

$ octave --quiet vectorized.m 
Elapsed time is 0.02057 seconds.

$ octave --quiet forloops.m 
Elapsed time is 2.45772 seconds.

So if you were using for loops, and you had perfect parallelism with no overhead, you'd have to break this up onto 119 processors just to break even with the non-for-loop !

Don't get me wrong, parallelism is great, but first get things working efficiently in serial.

Almost all of octave's built-in functions are already vectorized in the sense that they operate equally well on scalars or entire arrays; so it's often easy to convert things to array operations instead of doing things element-by-element. For those times when it's not so easy, you'll generally see that there are utility functions (like meshgrid, which generates a 2d-grid from the cartesian product of 2 vectors) that already exist to help you.

Wow, that is really surprising...I refined the grid a lot, and the for loop really is incredibly slow (I wanted to check how the running time of the vectorized code scaled with refining the grid)...I guess the best thing to is the maybe do parallel code in C or Fortran for big simulations, or just vectorize Octave/Matlab code, but I think the former can be much faster. — db1234, May 10 '12 at 23:01
One more comment: Can Octave (or Matlab for that matter) parallelize something like your vectorize.m program? — db1234, May 10 '12 at 23:04
The obvious question: why can't octave (or matlab) **"see"** that the for loop is just doing an array op and do it quickly? Internally, both of these could be simplified to the same machine code instructions. The fault lies in octave, not in the programmer, don't you think? — Sanjay Manohar, Dec 27 '15 at 17:02
@SanjayManohar It is a tough task for the compiler to determine dependencies and decide on which parts of the code can be parallelized. Using vectorized operations makes things straightforward. Now, functional languages allow for out of the box parallelism but at the price of adapting to a completely different programming paradigm. So, no, the fault does not lie with Octave. — Tarik, Mar 17 '16 at 04:52

score 4 · Answer 3 · answered Aug 21 '14 at 17:35

4

Now pararrayfun usage examples can be found there: http://wiki.octave.org/Parallel_package

answered Aug 21 '14 at 17:35

ederag

2,409
25
49

Parallel computing in Octave on a single machine -- package and example

3 Answers3

Linked