1

This is a closely related to the following post, answered very well by Przemyslaw Szufel.

How can I run a simple parallel array assignment operation in Julia?

Given that I have a 40-core machine, I decided to follow Przemyslaw's advice and go with @distributed, rather than Threads, to perform the array assignment operations. This sped things up quite nicely.

My algorithm's only slight difference with the above user's situation is that I have nested loops. Of course, I could always vectorize the array I'm performing the assignment operation on, but that would complicate my code. Should I simply include @sync @distributed before the outermost loop, and leave it at that? Or would I need to put additional macros before the (two, in my case) inner loops to maximize the benefits of parallelization?

GBatta
  • 41
  • 1
  • 9

1 Answers1

1

In case of distributed loops you normally want to parallelize only the outermost loop. Why? Because distributed the workload takes a significant amount of time.

However there are scenarios where you might want to search for different parallelization strategies.

Let us consider a scenario with unbalanced execution time. @distributed takes a naive approach equally splitting the loop between the workers. Suppose you have a loop such as:

for i in 1:100
    for j in 1:i
       ## do some heavy-lifting
    end
end

Putting @distributed before the outer loop will be very inefficient because all parallel executions is going to wait for the last chunk where all the longest values of j will be processed. This is a typical loop where the value of parallelization is going to be almost non-existent. In situation like this there are usually to approaches:

  • lazy approach: parallelize over the inner loop. This will be good where i takes values orders of magnitude greater than the number of cores
  • efficient approach. Create a proxy variable k in 1:(100*(100+1)/2), distribute over it and then calculate corresponding values of i and j

Finally, if the job times are heavily unbalanced and the approach above does not work you need to use some job polling mechanism. One way to go could be to use asyncmap to spawn remote tasks another way to go would be use external tools - I usually use simple some bash scripts for that - I published my approach to using bash to parallelize jobs on GitHub: https://github.com/pszufe/KissCluster

Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • This is so helpful, thank you Przemyslaw. Just a very clear explanation. I think the inner loop's nodes take about the same amount of time, but in case that's not true, I will try the efficient approach, or load balance polling, as you suggested. If you have any good references or books on parallel processing (not necessarily Julia-specific) to share, I'd love to hear about them! – GBatta Sep 20 '20 at 21:47
  • 1
    You could try to search for videos from JuliaCon by Matt Bauman they are good tutorials. I am also making Julia distributed computing workshops at different places from time to time. It would be also a good idea to read all posts on SO having tags `julia` and `parallel-processing` or `julia` and `distributed`. I remember writing here several useful and production-ready recipes that you can use for your workloads. – Przemyslaw Szufel Sep 20 '20 at 22:37
  • Thank you very much for the recommendations. I will check them out, and I'll look out for your workshops, also! – GBatta Sep 21 '20 at 16:55