spark-submit - cannot pickle class in package, but can pickle 'same' class in root folder

Question

In my Python-based Spark task 'main.py', I reference a protobuf generated class 'a_pb2.py'. If I place all files in the root directory like

/
- main.py
- a_pb2.py

and zip a_pb2.py into 'proto.zip', then run

spark-submit --py-files=proto.zip main.py

everything runs as expected.

However, if I move the protobuf classes to a package, organizing my files like

/
- main.py
- /protofiles
  - __init__.py
  - a_pb2.py

and zip /protofiles into 'proto.zip', then run

spark-submit --py-files=proto.zip main.py

the spark task fails, saying it cannot pickle class Account within a_pb2

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 272, in dump_stream bytes = self.serializer.dumps(vs) File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in dumps return pickle.dumps(obj, protocol) _pickle.PicklingError: Can't pickle class 'a_pb2.Account': import of module 'a_pb2' failed

I am assuming this class serialization is happening to distribute the dependencies to worker nodes. But wouldn't this serialization also happen in the initial case, and if the class was unpickle-able, it should fail then as well? Needless to say I am confused, and not familiar with the nuances of how spark treats dependency packages+modules vs just modules.

Any suggestions appreciated!

Worth also noting, I tried the failing situation but replaced the protobuf-generated module with a basic no-op module and things worked fine. — Jamie McPhail, Mar 28 '18 at 17:34
Hey Jamie, How did you fix this ?? I am stuck with a similar issue. — Golak Sarangi, Dec 20 '18 at 06:43
@GolakSarangi I also ran into this, my only workaround was to follow Jamie's approach and zip all my protos into a flat zip file. — ask417, Feb 15 '19 at 21:29

spark-submit - cannot pickle class in package, but can pickle 'same' class in root folder

0 Answers0

Linked