In my Python-based Spark task 'main.py', I reference a protobuf generated class 'a_pb2.py'. If I place all files in the root directory like
/
- main.py
- a_pb2.py
and zip a_pb2.py into 'proto.zip', then run
spark-submit --py-files=proto.zip main.py
everything runs as expected.
However, if I move the protobuf classes to a package, organizing my files like
/
- main.py
- /protofiles
- __init__.py
- a_pb2.py
and zip /protofiles into 'proto.zip', then run
spark-submit --py-files=proto.zip main.py
the spark task fails, saying it cannot pickle class Account within a_pb2
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 272, in dump_stream bytes = self.serializer.dumps(vs) File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in dumps return pickle.dumps(obj, protocol) _pickle.PicklingError: Can't pickle class 'a_pb2.Account': import of module 'a_pb2' failed
I am assuming this class serialization is happening to distribute the dependencies to worker nodes. But wouldn't this serialization also happen in the initial case, and if the class was unpickle-able, it should fail then as well? Needless to say I am confused, and not familiar with the nuances of how spark treats dependency packages+modules vs just modules.
Any suggestions appreciated!