1

I am trying to pip install package psycopg2 on the Dataproc cluster. I have tried the following but as my work computer has firewall restrictions so this isn't working.

REGION=<region>
gcloud dataproc clusters create my-cluster \
  --image-version 1.4 \
  --metadata 'CONDA_PACKAGES=psycopg2' \
  --metadata 'PIP_PACKAGES=psycopg2' \
  --initialization-actions \
  gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh,gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh

So now i have placed the psycopg2.whl and also psycopg2.tar.gz files ins GSC. Now I need to install them somehow during Dataproc cluster creation and seems its possible after looking at this https://stackoverflow.com/a/50280108/13433956 Can anyone provide more details on how to get pip to install the whl or tar.gz file to install from GCS through Dataproc initialization-actions. Thanks!

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Bajwa
  • 91
  • 1
  • 8

1 Answers1

1

I think to do this you can customize your initialization action to

  1. Download the wheel package from gcs to local file system
  2. pip install [local wheel package] from there.
  3. Create cluster using your customized initialization action file under a GCS path.

Please follow the best practice when doing so.

Henry Gong
  • 306
  • 1
  • 3
  • 1
    Ok so you mean that create a script say that'd do something like pip install gs:///psycopg2.tar.gz and then in init action i'll call it by gcloud dataproc clusters create my-cluster \ --image-version 1.4 \ --initialization-actions \ gs://\install.sh – Bajwa May 22 '20 at 03:43
  • Right, but you might have to download the gs:///psycopg2.tar.gz to some vm local path to install the wheel package. I haven't seen pip supports a gs:// path for install. – Henry Gong May 26 '20 at 17:51