0

I have a spark python script that has a groupBy in it. In particular, the structure is

import operator
result = sc.textFile(...).map(...).groupBy(...).map(...).reduce(operator.add)

When I run this in an ipython pyspark shell, it works just fine. However, when I try to script it and run it through spark-submit, I get a pickle.PicklingError: Can't pickle builtin <type 'method_descriptor'> error citing the groupBy as the concern. Is there a known workaround for this?

user592419
  • 5,103
  • 9
  • 42
  • 67

1 Answers1

0

It turns out there's a lot that pickle can't do, including lambdas. I was doing some of that and needed to be more careful.

user592419
  • 5,103
  • 9
  • 42
  • 67
  • Spark uses its own fork of `cloudpickle` for extending Pickle to support additional types, including lambdas. If you can come up with a small, standalone example of a Spark program that fails with this pickling error, could you open an issue at https://issues.apache.org/jira/browse/SPARK so we can fix it? Thanks! – Josh Rosen Nov 04 '14 at 03:26
  • Hey Josh. I ended up changing the structure of the program but am running into a consistent error that I wrote up here: http://stackoverflow.com/questions/26726780/what-is-the-difference-between-spark-submit-and-pyspark. Do you mind taking a look at that please? – user592419 Nov 05 '14 at 17:52