Julia: Parallel code slower than Sequential code, are there alternatives to remotecall()?

Question

When I run in parallel a simple function (with different inputs per worker) the time it takes for a parallelized version code is consistently longer than if it was sequential.

The simple function is

addprocs(2)
@everywhere function simple(i,Q,vt_0,ct_0)
  a=sum((rand(1:i),vt_0,ct_0))*2
  return a
end

The Parallelized version of my code is

#In Parallel
tic()
N=10000;
v=zeros(N);
c=zeros(N);
vt_0=0;
ct_0=0;

for i=1:N

  v[i]=fetch(remotecall(simple,2,i,1,vt_0,ct_0))

  c[i]=fetch(remotecall(simple,3,i,2,vt_0,ct_0))

  vt_0=v[i]
  ct_0=c[i]

end
toc()

While the sequential code would be just calling the function in each iteration of the loop (no remote call() nor fetch(), I am thinking the way in which I call the workers in parallel in Julia with remotecall() is the root of the problem, but I am not sure.

Can anybody help me figure out why this parallel code is so slow compared to just call the function? Or is this simple calculation not worth parallelizing it?

EDIT:
Here is the Sequential code:

#Sequential
tic()
N=10000;
v=zeros(N,2);
c=zeros(N,2);
vt_0=0;
ct_0=0;

for i=1:N

  v[i]=simple(i,1,vt_0,ct_0)

  c[i]=simple(i,2,vt_0,ct_0)

  vt_0=v[i]
  ct_0=c[i]
end
toc()

Are you transferring data between workers and requester for each step? I recall from CUDA, there is an overhead exchanging data between host and worker, and to compensate that, we limit transfers and maximize remote operations — A.Rashad, Oct 25 '17 at 20:05
I don't think I am transferring data between workers at each step, but as you can see, I change the inputs for the function each step based on the previous step. — Gunnar, Oct 25 '17 at 20:20
but you transfer data from the requesting agent to the worker, and from the worker back to the requesting agent for each step, correct? — A.Rashad, Oct 25 '17 at 20:25
Ah I see what you mean. Yes I believe that's what's going on. But then again I'm not sure. Would that be a limitation? — Gunnar, Oct 25 '17 at 20:38
yes, in parallel processing, you need to bring compute to data not the other way around, because data moving is more expensive than data processing, even on the network. so if you can dispatch larger chunks to workers, this would make things faster — A.Rashad, Oct 25 '17 at 20:42
What are the actual `toc()-tic()` times for the same code, being run under a `[SERIAL]`-process-scheduling, resp. under the `[CONCURRENT]`-process-scheduling, **but with the `rand(1:i)` replaced with a `10 [s]` sleep?** — user3666197, Oct 25 '17 at 20:52
Don't benchmark in the global scope! The code to be benchmarked should be in functions, and ideally use BenchmarkTools https://github.com/JuliaCI/BenchmarkTools.jl if you want a decent estimate. Also see https://docs.julialang.org/en/stable/manual/performance-tips/ — Alexander Morley, Oct 26 '17 at 12:33
@AlexanderMorley Well, your exclamation mark is not adequate. The O/P problem is a way simpler and solved below, where both memory-allocations and process-scheduling have been excluded from the section-under-test ( so **neither** the `@time` macro, nor the `@btime ...$..()..` syntax-artillery based profiling **are of any use here**, if for nothing else, then due to their pre-condition to serve as decorators on a function declaration edge of a scope-range, which would **add more noise to an already weak signal** (due to very-"shallow" amount of computing inside the section-under-test). (cont'd) — user3666197, Oct 26 '17 at 20:04
Imagine what would be a direct implication in case your proposal were true -- the whole neuroscience infrastructure, very gratefully funded by EU -- would have to work as charm just by such "in-function-profiling". The actual result is the very opposite, the supercomputing infrastructures are among the extensively greatest on the Continent, whereas an overall (co-)-processing performance is not. So all "in-function-profiling" did not solve the root-cause. **An easily set `tic()...toc()`-bracket is both trivial + adequate for having a fast quantitative basis for process-scheduling comparison** — user3666197, Oct 26 '17 at 20:10

user3666197 · Accepted Answer · 2017-10-26T04:45:09.890

Even for the case of the very-"shallow"-computing inside the simple(), Julia documentation recommends to rather use remotecall_fetch() instead of fetch( remotecall( ... ) ):

The function remotecall_fetch() exists for this purpose. It is equivalent to fetch(remotecall(...)) but is more efficient.

The `[PARALLEL]`-process `@spawn` / `fetch()` overheads

This is a very often overlooked topic, great to actively recognise and trying to reason about this subject. Actually this is a common root-cause of adverse impact on [PARALLEL]-process scheduling final performance. If indeed interested in details, getting both the mathematical model + an interactive UI-tool to simulate / evaluate the net-effects of the overhead-strict speedup formula on [PARALLEL] code-executions for as much as 2 .. 8000+ CPU-cores, feel free to read more on this here

The main "Suspect" here are the value-inter-process-dependencies:

One might be hidden - inside the system function rand(). In case a cryptographically strong implementation is used, each, indeed EACH call to the rand() must update the central source-of-randomness'-state. That means, that for this particular reason, all, indeed ALL the spawned processes have to have established and also maintained a queue to this central-shared-service, which will not be present at all ( as it can make zero source-of-randomness'-state updates troubles ) in a simple [SEQ]-code-execution, but will require some additional hidden overheads ( MUTEX / LOCKS / value-update atomics et al ) in case of [PAR]-code-execution units, that actually "share" this central resource ( being hidden to be strictly transactional, with a concurrency-capacity of 1, operating a 1:N-blocking-logic signalling, with none rollback ... ) and must execute and enforce the atomically safe update of the source-of-randomness'-state, before any next call to the source-of-randomness' service may get served.
the other is {prior|next}-loop step dependency, the more once mutually crossed among the pair of calls to simple()-process instances. As illustrated below, this mutual inter-dependence actually makes all potentially spawned-calls to have to be arranged as a pure [SEQ]-process-schedule, and the "remaining" sum() is indeed not much meat to chew in [PARALLEL]-process schedule.

ACTUAL MUTUAL INTER-PROCESS DEPENDENCY
ILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOP
SO AS TO MAKE `[PARALLEL]`-PROCESSING MORE EFFICIENT:

//                           +-------------<---- a way to "inject" a for-loop
//                           |  +----------<---- NOT CONSUMED AT ALL
//                           |  |
//                           |  |             +--******* BUT ********* DEPENDENCY
//                           |  |             |
//                           |  |             v
//                           |  |  +-------<-->- v[]-{prior|next}-step DEPENDENCY
//                           |  |  |     +-<-->- c[]-{prior|next}-step DEPENDENCY
//                           |  |  |     |
//          function simple( i, Q, vt_0, ct_0 )
@everywhere function simple( N, Q, vt_0, ct_0, aVEC )
   for     i  = 1:N
      aVEC[i] = sum( (  rand(1:i), vt_0, ct_0 ) )*2
      //   a  = sum( (  rand(1:i), vt_0, ct_0 ) )*2
      // ret a
end

While adding CSP-channels with explicit inter-process communication via { put!() | take!() } Channel-methods may solve this dependency being communicated, guess what, the coroutines-scheduling will only add additional overheads, so expect to pay even more to get less.

A minor note on raw-profiling:

In all cases, it is advisable to place a tic() .. toc()-bracket tight to the code-section under test, and avoid and exclude any and all memory-allocations and similar extremely long and noisy parts from the actual measured-execution:

// ------------------------------------------------------------<SuT>-START
tic()
for i=1:N

  v[i]=fetch(remotecall(simple,2,i,1,vt_0,ct_0))
  c[i]=fetch(remotecall(simple,3,i,2,vt_0,ct_0))

  vt_0=v[i]
  ct_0=c[i]

end
toc()
// ------------------------------------------------------------<SuT>-FINISH

Extremely thoughtful and so helpful. Thank you very much (and yeah that `Q` was just to test some other stuff; forgot to take it out) — Gunnar, Oct 26 '17 at 03:53

Julia: Parallel code slower than Sequential code, are there alternatives to remotecall()?

1 Answers1

The `[PARALLEL]`-process `@spawn` / `fetch()` overheads

The main "Suspect" here are the value-inter-process-dependencies:

ACTUAL MUTUAL INTER-PROCESS DEPENDENCY
ILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOP
SO AS TO MAKE `[PARALLEL]`-PROCESSING MORE EFFICIENT:

A minor note on raw-profiling:

Linked

Julia: Parallel code slower than Sequential code, are there alternatives to remotecall()?

1 Answers1

The [PARALLEL]-process @spawn / fetch() overheads

The main "Suspect" here are the value-inter-process-dependencies:

ACTUAL MUTUAL INTER-PROCESS DEPENDENCYILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOPSO AS TO MAKE [PARALLEL]-PROCESSING MORE EFFICIENT:

A minor note on raw-profiling:

Linked

The `[PARALLEL]`-process `@spawn` / `fetch()` overheads

ACTUAL MUTUAL INTER-PROCESS DEPENDENCY
ILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOP
SO AS TO MAKE `[PARALLEL]`-PROCESSING MORE EFFICIENT: