3

I created a cluster in Google DataProc with the following commands:

gcloud beta dataproc clusters create my-cluster \
--project my-project \
--bucket my-bucket \
--region my-region \
--zone my-zone \
--num-workers 5 \
--service-account my-service-account \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh,gs://goog-dataproc-initialization-actions-${REGION}/datalab/datalab.sh \
--metadata gcs-connector-version=2.0.0 \
--metadata bigquery-connector-version=1.0.0 \
--scopes cloud-platform \
--optional-components=ANACONDA,JUPYTER,ZEPPELIN,PRESTO \
--metadata PIP_PACKAGES=numpy:scipy:pandas:scikit-learn:matplotlib:seaborn \
--image-version=1.4 \
--properties=^#^spark:spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest.jar'#spark:spark.jars.packages='org.apache.spark:spark-avro_2.12:2.4.4'#zeppelin:zeppelin.notebook.gcs.dir="gs://${BUCKET}/notebooks/zeppelin/${CLUSTER_NAME}"#dataproc:jupyter.notebook.gcs.dir="gs://${BUCKET}/notebooks/jupyter/${CLUSTER_NAME}"

I tried two methods for pip:

1) Adding gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh to initialization-actions, following this link: Dataproc python configuration, but it caused creation of cluster failed.

The error message is: Initialization action failed. Failed action 'gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh', see output in: gs://my-bucket/google-cloud-dataproc-metainfo/df1234gs-3423-647e-bdf4-dfas1231das/my-cluster-m/dataproc-initialization-script-2_output. I can share the file if needed.

2) Use the commands above to create the cluster. It was able to create the cluster. Strangely, when I login to the master node and tried pip, it seems that pip command is available. However, if I run the pip command in Jupyter, for example, pip install --upgrade pip, while the command went through, the kernel goes into dead and restart iteratively and makes Jupyter not usable.

I used to be able to create cluster and ran smoothly with Image 1.4.4, gcs_connector of 1.9.16 and bq_connector of 0.13.16. I'm not sure if there is any changes under the hook. Any suggestion is appreciated.

user2830451
  • 2,126
  • 5
  • 25
  • 31

1 Answers1

1

I changed --metadata PIP_PACKAGES=numpy:scipy:pandas:scikit-learn:matplotlib:seaborn to --metadata PIP_PACKAGES="numpy scipy pandas scikit-learn matplotlib seaborn" in your command and was able to use the pip-install.sh initialization script.

cyxxy
  • 608
  • 3
  • 8
  • Thank you for the suggestion. I was able to use the `pip-install.sh` in initlization. However, I found that the pip may have some compatible issues: when I use `pip install --upgrade pip` in Jupyter, then Jupyter will keep failing and is unusable. I edited the OP for this issue. – user2830451 Jan 30 '20 at 00:59
  • Try to run `pip install --upgrade pip` using init action instead from inside Jupyter noteboook. – Igor Dvorzhak Jan 30 '20 at 23:07
  • @IgorDvorzhak Could you please highlight on how to upgrade pip using init action. I am facing similar issue [link] (https://stackoverflow.com/questions/70743642/dataproc-cluster-creation-is-failing-with-error-failed-to-build-libcst) – codninja0908 Jan 18 '22 at 08:01
  • @Neha0908 did you try to create [init action](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions) that just runs `pip install --upgrade pip` command? – Igor Dvorzhak Jan 18 '22 at 15:37
  • @IgorDvorzhak I tried to run custom init script(having pip upgrade command) using gcloud dataproc cluster create command this way ```--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://my-bucket/upgrade-pip.sh' \ ``` The upgrade-pip.sh file has contents : ```#!/bin/bash pip install --upgrade pip ``` . Still it did not resolve the issue. Please let me know this is what you suggested – codninja0908 Jan 19 '22 at 16:59
  • @user2830451 Were you able to fix the pip issue using the gcloud cluster creation script. – codninja0908 Jan 19 '22 at 17:20