I have a dataset and I wanted to test different classifiers in parallel using Spark with Python. For example, if I want to test a Decision Tree and a Random Forest, how could I run them in parallel?
I have tried a few approaches but I keep getting:
cPickle.PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I was trying to do this (which had worked well using scikit-learn's classifiers instead of Spark's:
def apply_classifier(clf, train_dataset, test_dataset):
model = clf.fit(train_dataset)
predictions = model.transform(test_dataset)
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)
return [(model, predictions)]
...
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
classifiers = [dt, rf]
sc.parallelize(classifiers).flatMap(lambda x: apply_classifier(x, train_dataset, test_dataset)).collect()
Any suggestions on how I can manage to do this?
Thanks!