Even for the case of the very-"shallow"-computing inside the simple()
, Julia documentation recommends to rather use remotecall_fetch()
instead of fetch( remotecall( ... ) )
:
The function remotecall_fetch()
exists for this purpose. It is equivalent to fetch(remotecall(...))
but is more efficient.
The [PARALLEL]
-process @spawn
/ fetch()
overheads
This is a very often overlooked topic, great to actively recognise and trying to reason about this subject. Actually this is a common root-cause of adverse impact on [PARALLEL]
-process scheduling final performance. If indeed interested in details, getting both the mathematical model + an interactive UI-tool to simulate / evaluate the net-effects of the overhead-strict speedup formula on [PARALLEL]
code-executions for as much as 2 .. 8000+ CPU-cores, feel free to read more on this here
The main "Suspect" here are the value-inter-process-dependencies:
One might be hidden - inside the system function rand()
. In case a cryptographically strong implementation is used, each, indeed EACH call to the rand()
must update the central source-of-randomness'-state. That means, that for this particular reason, all, indeed ALL the spawned processes have to have established and also maintained a queue to this central-shared-service, which will not be present at all ( as it can make zero source-of-randomness'-state updates troubles ) in a simple [SEQ]
-code-execution, but will require some additional hidden overheads ( MUTEX / LOCKS / value-update atomics et al ) in case of [PAR]
-code-execution units, that actually "share" this central resource ( being hidden to be strictly transactional, with a concurrency-capacity of 1, operating a 1:N-blocking-logic signalling, with none rollback ... ) and must execute and enforce the atomically safe update of the source-of-randomness'-state, before any next call to the source-of-randomness' service may get served.
the other is {prior|next}-loop step dependency, the more once mutually crossed among the pair of calls to simple()
-process instances. As illustrated below, this mutual inter-dependence actually makes all potentially spawned-calls to have to be arranged as a pure [SEQ]
-process-schedule, and the "remaining" sum()
is indeed not much meat to chew in [PARALLEL]
-process schedule.
ACTUAL MUTUAL INTER-PROCESS DEPENDENCY
ILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOP
SO AS TO MAKE [PARALLEL]
-PROCESSING MORE EFFICIENT:
// +-------------<---- a way to "inject" a for-loop
// | +----------<---- NOT CONSUMED AT ALL
// | |
// | | +--******* BUT ********* DEPENDENCY
// | | |
// | | v
// | | +-------<-->- v[]-{prior|next}-step DEPENDENCY
// | | | +-<-->- c[]-{prior|next}-step DEPENDENCY
// | | | |
// function simple( i, Q, vt_0, ct_0 )
@everywhere function simple( N, Q, vt_0, ct_0, aVEC )
for i = 1:N
aVEC[i] = sum( ( rand(1:i), vt_0, ct_0 ) )*2
// a = sum( ( rand(1:i), vt_0, ct_0 ) )*2
// ret a
end
While adding CSP-channels with explicit inter-process communication via { put!() | take!() }
Channel-methods may solve this dependency being communicated, guess what, the coroutines-scheduling will only add additional overheads, so expect to pay even more to get less.
A minor note on raw-profiling:
In all cases, it is advisable to place a tic() .. toc()
-bracket tight to the code-section under test, and avoid and exclude any and all memory-allocations and similar extremely long and noisy parts from the actual measured-execution:
// ------------------------------------------------------------<SuT>-START
tic()
for i=1:N
v[i]=fetch(remotecall(simple,2,i,1,vt_0,ct_0))
c[i]=fetch(remotecall(simple,3,i,2,vt_0,ct_0))
vt_0=v[i]
ct_0=c[i]
end
toc()
// ------------------------------------------------------------<SuT>-FINISH