Distributed Python On Multi Machine Cluster

Question

Following is the Requirement -:

class MultiMachineDoWork:

    def Function1(self, A, B):  
        return A+B

    def Function2(self, A, B):  
        return A*B 

    def Function3(self, A, B):  
        return A**B  

    def Function4():  
        X = MultiMachineDoWork.Function1(5,10)
        Y = MultiMachineDoWork.Function2(5,10)
        Z = MultiMachineDoWork.Function3(5,10)
        return X+Y+Z

Assuming that Function1,Function2 & Function3 take very long time each,its better to run them on distributed model in parallel on Machine L,M & N respectively. And Function 4 can run on Machine P which can collect the results and combine.

MapReduce Works on some sort of similar concept but runs same function on different part of Data... Can Dask / Ray / Celery be of any use in this case study...

If custom solution has to be built,what and how should the solution proceed...

Pydoop/Spark With Dask Local Cluster?

Real Life Case Study - Ensemble Model For ML Classification.One Function For RandomForest,One For Support Vector & Once For XGBoost.All running on same dataset...

Surprising Really !! The question has a negative vote..I guess I am the only person left in the world who doesn't know such trivial architectural question.... — Rakesh Kumar Khandelwal, May 20 '19 at 04:12
Seeing as you've mentioned Pydoop, there's been some degree of experimentation with NN training here: https://github.com/crs4/pydoop-examples/tree/master/examples/pydeep. That's currently inactive and might not work properly with the next Pydoop release, but it should provide some useful pointers. — simleo, May 21 '19 at 09:44

score 1 · Answer 1 · answered May 19 '19 at 20:58

Distributing task/function/computation across multiple machines/nodes can be done using the various framework in python. The most common and widely used are Ray, Dask, and PySpark and which one of these to use will really depend on usecase.

For simple function/task distribution, you can use Ray library (@ray.remote) to distribute and then use get method to integrate/compute the result back. Same can be done through dask as well.

https://rise.cs.berkeley.edu/blog/modern-parallel-and-distributed-python-a-quick-tutorial-on-ray/

I will prefer Spark/Pyspark, when you are dealing with a large dataset and you want to perform some kind of ETL operation to distribute the huge dataset across multiple nodes and then perform some transformation or operation on it. Note Spark or mapreduce concept assume you bring the computation towards data and it will execute the same/similar task on a different subset of data and finally perform some aggregation (involves shuffling).

Spark/Pyspark supports ensemble through its inbuilt random forest or gradient boosting tree algorithm. But training separate models (random forest, gradient trees, logistic regression, etc) on separate nodes/executor is currently not supported in spark (out of the box). Although it might be possible through the customized spark code, just like the way they are doing internally for random forest (training multiple decision trees).

The real-life scenario of ensembling can be easily done with dask and sklearn. Dask integrates well with scikit-learn xgboost etc to perform parallel computation across distributed cluster nodes/workers using joblib context manager.

Now for ensemble scenario, you can use different models/algorithm of scikit-learn(RandomForest, SGD, SVM, Logistic Regression) and use the Voting classifier to combine multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone (i.e basics of ensemble concept).

Using Dask will train individual sub-estimators/models on different machines in a cluster.

https://docs.dask.org/en/latest/use-cases.html

High level the code will look like-

classifiers = [
    ('sgd', SGDClassifier(max_iter=1000)),
    ('logisticregression', LogisticRegression()),
    ('xgboost', XGBClassifier()
    ('svc', SVC(gamma='auto')),
]
clf = VotingClassifier(classifiers) 

with joblib.parallel_backend("dask"):
    clf.fit(X, y)

** The above can be achieved through other distributed framework like Ray/Spark.etc as well, but it will need more customized coding.

Hope this information helps you!

With Dask above example,the only problem is that its difficult to isolate functioning...we just call dask and then kind of like assume that parallelism works in the way the Case Scenerio Demands.. I am optimistic that some library would give us absolute control over the granuality of execution... — Rakesh Kumar Khandelwal, May 20 '19 at 04:10

Distributed Python On Multi Machine Cluster

1 Answers1

Linked