I am struggling to ship 3rd part python packages to spark executors. I have referred to a lot of other posts/questions but can't get this working. What I have tried so far
Spark: 2.3 Mode - Yarn/client
- Created a directory dependencies
- Moved all my packages from site-packages to dependencies
- Zipped at first level -
cd dependencies zip -r ../dependencies.zip .
- did spark submit with --py-files dependencies.zip
- added dependencies.zip using sys.path.insert(0, "dependencies.zip") also sc.addPyFile('dependencies.zip') in the code
- verified that it is getting added by printing sys.path
- But when I try to import package I get import error - package not found
ImportError: No module named 'google.cloud.pubsub_v1'**strong text**
referred to other posts like I can't seem to get --py-files on Spark to work I can't seem to get --py-files on Spark to work
They all seem to suggest same but somehow I can't get this working.