0

I am struggling to ship 3rd part python packages to spark executors. I have referred to a lot of other posts/questions but can't get this working. What I have tried so far

Spark: 2.3 Mode - Yarn/client

  1. Created a directory dependencies
  2. Moved all my packages from site-packages to dependencies
  3. Zipped at first level -

cd dependencies zip -r ../dependencies.zip .

  1. did spark submit with --py-files dependencies.zip
  2. added dependencies.zip using sys.path.insert(0, "dependencies.zip") also sc.addPyFile('dependencies.zip') in the code
  3. verified that it is getting added by printing sys.path
  4. But when I try to import package I get import error - package not found ImportError: No module named 'google.cloud.pubsub_v1'**strong text**

referred to other posts like I can't seem to get --py-files on Spark to work I can't seem to get --py-files on Spark to work

They all seem to suggest same but somehow I can't get this working.

Tony
  • 618
  • 12
  • 27
some_user
  • 315
  • 2
  • 14
  • 1
    There is not a simple answer for this. For me what worked was creating a conda environment and ship it to all executors. I followed the instruction in this cloudera community guide - https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551 Try this, good luck:-) – Raghu Jul 18 '20 at 21:00
  • Thanks Raghu .. i will try that .. saw it before but thought its little more tedious to ship the entire environment when things might work with just site-packages, we have other complexities like environments are controlled by admins etc. But i think this will be my last resort before asking dependencies to be locally installed on all nodes. – some_user Jul 19 '20 at 15:16
  • been there, done that :-) this is a one time setup. but once I am done, things went smooth. – Raghu Jul 19 '20 at 15:37
  • Thanks Raghu.. This worked :-) – some_user Jul 27 '20 at 17:02

2 Answers2

0

Create an empty __init__.py in the zip at root level Then while importing the module, try importing it with import zipfile.google.cloud.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
  • Tried this already, making explicit path call to dependencies.google.cloud works at first level but it starts giving import error for modules inside google.cloud which are referring back to google.cloud. It is somehow not considering dependencies.zip unless specified explicitly even when dependencies.zip comes at top when i print sys.path. – some_user Jul 19 '20 at 15:18
0

Followed the steps mentioned in https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551

rather than coonda just used python3 virtual env. It worked fine.

PS: The example in the article will work for cluster mode. For yarn client mode, you need to change first pyspark-python to point locally rather than using # pointer.

some_user
  • 315
  • 2
  • 14