I created a cluster in Google DataProc with the following commands:
gcloud beta dataproc clusters create my-cluster \
--project my-project \
--bucket my-bucket \
--region my-region \
--zone my-zone \
--num-workers 5 \
--service-account my-service-account \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh,gs://goog-dataproc-initialization-actions-${REGION}/datalab/datalab.sh \
--metadata gcs-connector-version=2.0.0 \
--metadata bigquery-connector-version=1.0.0 \
--scopes cloud-platform \
--optional-components=ANACONDA,JUPYTER,ZEPPELIN,PRESTO \
--metadata PIP_PACKAGES=numpy:scipy:pandas:scikit-learn:matplotlib:seaborn \
--image-version=1.4 \
--properties=^#^spark:spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest.jar'#spark:spark.jars.packages='org.apache.spark:spark-avro_2.12:2.4.4'#zeppelin:zeppelin.notebook.gcs.dir="gs://${BUCKET}/notebooks/zeppelin/${CLUSTER_NAME}"#dataproc:jupyter.notebook.gcs.dir="gs://${BUCKET}/notebooks/jupyter/${CLUSTER_NAME}"
I tried two methods for pip:
1) Adding gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
to initialization-actions, following this link: Dataproc python configuration, but it caused creation of cluster failed.
The error message is: Initialization action failed. Failed action 'gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh', see output in: gs://my-bucket/google-cloud-dataproc-metainfo/df1234gs-3423-647e-bdf4-dfas1231das/my-cluster-m/dataproc-initialization-script-2_output
. I can share the file if needed.
2) Use the commands above to create the cluster. It was able to create the cluster. Strangely, when I login to the master node and tried pip, it seems that pip command is available. However, if I run the pip command in Jupyter, for example, pip install --upgrade pip
, while the command went through, the kernel goes into dead and restart iteratively and makes Jupyter not usable.
I used to be able to create cluster and ran smoothly with Image 1.4.4, gcs_connector
of 1.9.16 and bq_connector
of 0.13.16. I'm not sure if there is any changes under the hook. Any suggestion is appreciated.