I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like to use scikit-learn and train a model (let's say a simple k-means) on each partition which exists on each worker node. As scikit-learn algorithms takes a Pandas dataframe, my initial idea was to call toPandas
for each partition and then train my model. However, the toPandas
function collects the DataFrame into the driver and this is not something that I'm looking for. Is there any other way to achieve such a goal?

- 6,085
- 20
- 92
- 164
-
If I can somehow convert each partition to a dataframe into an array-like structure, that is possible, right? – HHH Jul 04 '16 at 15:13
-
1I don't see how is it relevant to compute a model on each partition. What does even mean ? In practice, how do you assemble models ? – eliasah Jul 04 '16 at 15:15
-
so let's say if I run a kmeans on each partition, then somehow I should transfer all the centroid point to the driver. This would be like an approximate k-means. However, I don't know how to transfer the centroid points to the driver now. Any idea? – HHH Jul 04 '16 at 15:24
-
1It still doesn't make sense. Those centroids were trained considering a certain vector space. You can't just take them and perform a average or so. – eliasah Jul 04 '16 at 15:26
-
1And if you want to use Kmeans, why don't you use spark's implementation directly ? – eliasah Jul 04 '16 at 15:26
-
I agree that this won't be the best solution but the main reason to do such a thing is to see how I can integrate scikit-learn with pyspark. So, I'm not looking for the best clustering. Having said that, do you think we can somehow use k-means (or any other clustering or classification) in that way? That is, how to convert each partition to an array-like structure? – HHH Jul 04 '16 at 15:31
-
1This question is getting quite broad. First, you can't integrate scikit-learn with spark in that way. Second, no, you still can't use clustering methods in that way, it doesn't make any sense. I won't answer the third one because array-like structure can be an RDD, it can be anything, so that doesn't make any sense as well. – eliasah Jul 04 '16 at 15:37
-
http://stackoverflow.com/q/34438829/1560062 - but @eliasah is right here. Having weak models won't get you anywhere alone especially when you're interested in unsupervised learning. – zero323 Jul 04 '16 at 16:15
3 Answers
scikit-learn can't be fully integrated with spark as for now, and the reason is that scikit-learn algorithms aren't implemented to be distributed as it work just on a single machine.
Nevertheless, you can find ready to use Spark - Scikit integration tools in spark-sklearn that supports (for the moments) executing GridSearch on Spark for cross validation.
Edit
As of 2020 the spark-sklearn is deprecated and the joblib-spark is the recommended successor of it. Based on the documentation you can easily distribute a cross validation to a Spark cluster like this:
from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark
register_spark() # register spark backend
iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
A GridSearchCV can be distributed in the same way.

- 125
- 6
-
what if I want to execute an arbitrary model? let's say run a kmeans on each partition. Is that supported in spark-sklearn? – HHH Jul 04 '16 at 15:15
-
This library supports just distributed grid search last time I've checked. And it still seems the case now. Unfortunately, distributing algorithms isn't just a plug and play or it would have been very easy. Unless sklearn implements those algorithms directly on spark, it wouldn't be possible to integrate that easy. – eliasah Jul 04 '16 at 15:22
-
What about on a notebook? Let's say we are going to do something integrating Sklearn and PySpark on Colab. Is it possible? – Memphis Meng Jan 17 '22 at 22:45
no, scikit learn doesn't work with pyspark & reason being scikit learn is a package which will work an individual computer whereas spark is a distributed environment.

- 17
- 1
def pandas_filter_func(iterator):
for pandas_df in iterator:
yield pandas_df[pandas_df.a == 1]
df.mapInPandas(pandas_filter_func, schema=df.schema).show()
Took it from here:

- 976
- 9
- 23