In Spark, how do I use groupBy with spark-submit?

Question

I have a spark python script that has a groupBy in it. In particular, the structure is

import operator
result = sc.textFile(...).map(...).groupBy(...).map(...).reduce(operator.add)

When I run this in an ipython pyspark shell, it works just fine. However, when I try to script it and run it through spark-submit, I get a pickle.PicklingError: Can't pickle builtin <type 'method_descriptor'> error citing the groupBy as the concern. Is there a known workaround for this?

score 0 · Answer 1 · answered Nov 04 '14 at 01:30

0

It turns out there's a lot that pickle can't do, including lambdas. I was doing some of that and needed to be more careful.

answered Nov 04 '14 at 01:30

user592419

5,103
9
42
67

Spark uses its own fork of `cloudpickle` for extending Pickle to support additional types, including lambdas. If you can come up with a small, standalone example of a Spark program that fails with this pickling error, could you open an issue at https://issues.apache.org/jira/browse/SPARK so we can fix it? Thanks! – Josh Rosen Nov 04 '14 at 03:26
Hey Josh. I ended up changing the structure of the program but am running into a consistent error that I wrote up here: http://stackoverflow.com/questions/26726780/what-is-the-difference-between-spark-submit-and-pyspark. Do you mind taking a look at that please? – user592419 Nov 05 '14 at 17:52

In Spark, how do I use groupBy with spark-submit?

1 Answers1