1

I want to implement an online recursive parallel algorithm in python.

So every time I got a new observation I want to calculate a new coefficient matrix. Each row on this matrix must be calculated on parallel.

Is it too expensive to create for each time-step a new process that takes as an input the row of the previous time-step and calculates the row for the next and after the calculation kill it and create it again?

Or is it better to have the process running for the whole time? If the second is the best of the two, how can I resume the same process but with different inputs?

Is there any way?

user3666197
  • 1
  • 6
  • 50
  • 92
Bekromoularo
  • 99
  • 11
  • The problem is clear and sound. **Yet, do not hesitate to be even more quantitative in your Question above** - add matrix-dimensions, add nanoseconds per time-slice node calculations, add the total number of time-slices in the forward-processing ( the time window to step-forwards ), include the source-data precision ( for the measurements acquired ) and the target output precision ( for results delivered ). Number of cores is nothing special, one can have 8192-CPU devices these days, yet the design is very sensitive to the points raised below ( NUMA-latency costs, overhead-avoidance ... ) – user3666197 Mar 05 '18 at 20:08

1 Answers1

0

Is it too expensive to create for each time-step a new process...?

Yes, this is always expensive and often very expensive. Persistent processes, that do not make you pay the constant overhead costs per each of the time-slice processing, are a more promising option here, but many additional factors have to be also taken into account first.

All process-instantiation/termination overhead-costs are the more expensive, the less mathematically dense/complex is the task to calculate. So if your processing is cheap in [TIME]-domain, all the overhead costs will look the more expensive ( as your processing will have to spend them many times in the row ... )

All process-instatiations will also pay remarkable overhead-costs for memory (re-)allocations for data in [SPACE]-domain ( whereas having a feasible semi-persistent data-structures, that persistent processes can work with, an in-place matrix operators may save you a lot on memory-allocation overhead-avoidance ... very important topic on large scale matrices like in numerical mathematics processing for FEM, ANN, Video/image-kernel applications etc. )


Do not rely just on one's gut feelings.

Review all details of this logic in the re-formulated Amdahl's Law to have all the quantitative figures before deciding on this design dilemma. Benchmark each of the processing stages, including the process-instantiations, including memory-transfer costs ( in parameters ), including processing costs of the computing phase of the processing "inside" the one step forward computations, including the costs of re-distribution of results among all involved counterparties.

Only next you will be able to quantitatively decide a break-even point, after which more processes will not improve the processing ( will stop lowering the overall duration and will start add more overhead costs than the parallel-process accelerated computing may manage to cover ).


Is it better to have the process running for the whole time?

This may help a lot, once avoiding to pay the repetitive costs on process instantiation and termination.

Yet, there are costs on signalling and data re-propagation among all such computing-service processes. Next all processes have to fit inside a real RAM, so as not to lose on going swap-out/swap-in [SPACE]-motivated tidal waves that flow indeeeeeeed very slowly and would kill all the idea of [TIME]-motivated performance increases.


Do benchmark + vectorise + best, JIT/LLVM-compile the code. A must !

This is your strongest power for performance increases, given python is your choice. If you are serious into performance, needless to tell here more. numpy + numba are just great for this. If shaving the last few [ns] for already performant code, narrow-specialisations of calling-interfaces and better vectorisation alignments ( cache friendliness ) are your tools for this.

user3666197
  • 1
  • 6
  • 50
  • 92
  • Thank you for the quick reply. Calculation for every row is very expensive and I want to scale up to 200 cores for a very high dimensional problem. So 200 parallel processes. My real question is how can I restart the same process with different inputs? Since my calculation is recursive. Is this possible? Or I will need to kill and create different porcess? – Bekromoularo Mar 05 '18 at 19:48
  • Excuse me to skip the recursive design feature for the moment. Do not hesitate to be more quantitative in you Question above - add matrix-dimensions, add nanoseconds per time-slice node, add the total number of time-slices in the forward-processing ( the time window to step-forwards ), include the source-data precision ( for the measurements acquired ) and the target output precision ( for results delivered ). Number of cores is nothing special, one can have 8192-CPU devices these days, yet the design is very sensitive to the points raised above ( NUMA-latency costs, overhead-avoidance ... ) – user3666197 Mar 05 '18 at 20:06
  • I do not have these numbers now. But killing and creating 200 hundred and even more processes every time step is not something that is definitely expensive? From the other hand if the killing and creating of a process is something that is done on parallel on each different core should not be something cheap? I am really new to this and I maybe have trivia questions, but I am confused. – Bekromoularo Mar 05 '18 at 20:16
  • Getting these numbers is easy and ought be **the first step to do**: may re-use the Test-Case-XYZ **design-patterns discussed here** >>> https://stackoverflow.com/a/47594399 Measured duration figures are in [us]. Next use the re-formulated Amdahl's Law to see the realistic effects. – user3666197 Mar 05 '18 at 20:41