What multiprocessing method should I use for Machine Learning training + accuracy tests?

Question

I have a function that accepts an array and returns a number.

I want to run this function on 10 different input arrays, and then return the sum of all results.
What is the best way to tell Python to run these 10 calculations in parallel on a computer with 4 cores?

In the documentation there are many different ways for concurrent execution, and also many different packages for parallel processing. With so many different methods, which of these should I use for the task above?

EDIT:
the function is a Machine-Learning function:
it partitions the input array to two parts - a "train" and a "test" part of the DataSET. It trains a certain classifier on the "train" part, and returns a prediction accuracy of such a trained classifier on the "test" part.

What's wrong with `multiprocessing`? It's even in the stdlib. — Ignacio Vazquez-Abrams, Oct 25 '17 at 11:41
@IgnacioVazquez-Abrams that was indeed the first library that I found, but then I saw so many other options, like threading, concurrent and subprocess in the stdlib, as well as many third-party options that I became confused... what is the best tool for the task? — Erel Segal-Halevi, Oct 25 '17 at 13:43
Depending on the function (not given...) you might vectorize it within numpy first (can be much more powerful compared to 4 cores multiprocessing). And yeah, the best tool for the task surely depends on the function! — sascha, Oct 25 '17 at 13:50
@sascha I added an explanation about what my function does. What do you mean by "vectorize it within numpy"? — Erel Segal-Halevi, Oct 25 '17 at 14:25
That does not matter anymore given your still inprecise function-description. (basic idea was: loopless numpy-usage which results in very fast operations using SIMD and sometimes Multithreading). — sascha, Oct 25 '17 at 14:26
@IgnacioVazquez-Abrams ***What's wrong with* `multiprocessing`?** Well, sir, with all due respect, it does not work in this question on Machine Learning pipelines' context. **Why?** First, because it has been already used inside the `scikit` tools ( and heavily harnessed if `n_jobs = -1`, so if one would try to add additional processes, the resulting performance will drop significantly -- if from nothing else, then from introduced cache-line devastations .. LRU states are not reinstated on context switch ). **Next** `multiprocessing` replicates all python state / DataSETs and will not fit RAM — user3666197, Oct 25 '17 at 17:32
@ErelSegal-Halevi Greetings, Erel. What do you think about usefulness of the answer and help provided in discussion and comments so far? StackOverflow encourages users to express pieces of usefulness, interesting examples or pure inspiration by UpVoting. What is your feedback to the Community, that helped you with your problem? — user3666197, Nov 08 '17 at 15:26
@user3666197 I am still not sure... which method should I use? Threading? Multiprocessing? Concurrent? Something else? — Erel Segal-Halevi, Nov 09 '17 at 12:25

user3666197 · Answer 1 · 2017-12-08T18:11:21.137

Best use the one that best "Fair Divides a Cake with Burnt Parts ... "

As per your domain expertise, the art of performance motivated scheduling is similarly constrained.

Given a few facts above,
the approach to a problem solution is also heavily scale-dependent.

Given a Machine Learning has been added,
the problem is not just a common and trivial [SEQ]-[PAR]-[SEQ] pipeline to process 10 different arrays and sum the partial results.

The bigger the ML-DataSET gets,
the larger will be the "Burnt Parts of the Cake"

Being inside a realm of numpy / python and their tools for Machine Learning, your driving forces will not be a will to "parallelise", but rather the Costs of trying to do so. Machine Learning pipelines for real-world problems may easily take large units to small tens of hours to get trained, using all, yes ALL localhost available CPU-cores, just for one train-part of the DataSET. The efficiency, professionally built into the numpy / numba / scikit tools is not easy to get much increased further on, so rather do not expect any low hanging fruit to be anywhere near these grounds. For more technical details on this, you might like to read this post.

Memory matters the most.

For small DataSETs, the [PSPACE]-costs in terms of memory footprint will remain easier, but that will also mean, that there will be harder to pay all the [PTIME]-costs, that will have to be paid per each parallel-process, yes the non-productive overheads fee, one always has to pay for just entering the show: { instantiation + setup + control + finalisation + dismantling }, which in this very case ( of small DataSETs ) will be hardly justified by any increased processing-performance, potentially coming out from an attempt to operate parallel-processing on too small-scale problems.

Available Processing Resources matter almost the same

In case, where a process is ordered to be operated on 10 process instances concurrently, the governing factor is the free-capacity of all resources-classes, that are relevant for such a process smooth execution -- a free ( and best uninterrupted ) CPU-core ( HT-off ), having non-blocked access to sufficient amount of ( free / private >> working set ) RAM memory ( no memory paging, no virtual-memory swaps ) plus minimum ( or fine-tuned ) disk-IO.

In cases, where such resources' capacities are not allowing all the named 10 instances of the Machine Learning process to smoothly execute indeed in parallel, another, "just"-[CONCURRENT], scheduling takes place and one may straight forget any and all the Amdahl's Law theoretical promises ( even from the original, "classical", overhead-naive formulation ).

While it is obvious for anyone, that if a bus, full of tourists, arrives to a five stars Marriott Hotel, equipped with just a pair of lifts from a reception floor, it is simply principally impossible for all such new guests to reach their room at the same time -- the same problem is not so crystal-clear in almost exactly the same setup in true-[PARALLEL]-process scheduling and many users start in the very same way - asking "Why my parallel code is slower than serial run?" or "Why is my parallel-code speedup not scaling?".

This said, you will have to maximise your hardware resources-pools and very carefully balance the parallel-processing ( setup + termination )-overhead costs, because without having at least a chance to pay less than you will potentially gain ( justify the overheads ), going parallel AT ANY COSTS, is simply a devastating policy, whoever has advised you to try to go into this a-priori lost game ( costs/benefits-wise ) direction.

What multiprocessing method should I use for Machine Learning training + accuracy tests?

1 Answers1

Best use the one that best "Fair Divides a Cake with Burnt Parts ... "

The bigger the ML-DataSET gets,the larger will be the "Burnt Parts of the Cake"

The bigger the ML-DataSET gets,
the larger will be the "Burnt Parts of the Cake"