3

We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

Building wheels for collected packages: pynacl, libcst
  Building wheel for pynacl (PEP 517): started
  Building wheel for pynacl (PEP 517): still running...
  Building wheel for pynacl (PEP 517): finished with status 'done'
  Created wheel for pynacl: filename=PyNaCl-1.5.0-cp37-cp37m-linux_x86_64.whl size=201317 sha256=4e5897bc415a327f6b389b864940a8c1dde9448017a2ce4991517b30996acb71
  Stored in directory: /root/.cache/pip/wheels/2f/01/7f/11d382bf954a093a55ed9581fd66c3b45b98769f292367b4d3
  Building wheel for libcst (PEP 517): started
  Building wheel for libcst (PEP 517): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/anaconda/bin/python /opt/conda/anaconda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpon3bonqi
       cwd: /tmp/pip-install-9ozf4fcp/libcst

Cluster configuration command :

gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://goog-dataproc-initialization-actions-us-east1/python/pip-install.sh' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>

Things I tried: a) I tried to install wheel package explicitly as part of pip packages but the issue does not resolve

b) Gcloud Command with upgrade pip script:

gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://<bucket-path>/upgrade-pip.sh','gs://goog-dataproc-initialization-actions-us-east1/python/pip-install.sh' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>
codninja0908
  • 497
  • 8
  • 29
  • Found a similar issue here https://stackoverflow.com/questions/61365790/error-could-not-build-wheels-for-scipy-which-use-pep-517-and-cannot-be-installe. Try upgrading pip? – Dagang Jan 18 '22 at 01:58
  • @Dagang Thanks for the suggestion. I have gone through that link but how to upgrade pip using init script during cluster creation is something I am trying to understand. Could you please highlight how it can be done. I do not have access to SSH on the nodes and do something manual – codninja0908 Jan 18 '22 at 08:57
  • Use `/opt/conda/default/bin/pip install --upgrade pip` in the init action. – Dagang Jan 18 '22 at 17:49
  • @Dagang I tried to run custom init script(having pip upgrade command) in the above cluster configuration command in this way ```--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://my-bucket/upgrade-pip.sh' \ ``` The upgrade-pip.sh file has contents : ```!/bin/bash pip install --upgrade pip``` . So you are suggesting to use ```/opt/conda/default/bin/pip install --upgrade pip``` instead of ```pip install --upgrade pip``` . – codninja0908 Jan 19 '22 at 17:22
  • @Dagang So you are suggesting to use ```/opt/conda/default/bin/pip install --upgrade pip``` instead of ```pip install --upgrade pip``` because I am creating a custom image having ```Anaconda ``` as my python interpretor – codninja0908 Jan 19 '22 at 17:37
  • At cluster creation time (including init actions), `/opt/conda/default` is a symbolic link to either `/opt/conda/miniconda3` or `/opt/conda/anaconda`, depending on which Conda env you are selecting, in your case it is Anaconda. At custom image creation time, you want to use `/opt/conda/anaconda/bin/pip` to make sure it is the Anaconda pip. `pip` doesn't necessarily point at Anaconda pip. – Dagang Jan 19 '22 at 18:02

1 Answers1

1

Seems you need to upgrade pip, see this question.

But there can be multiple pips in a Dataproc cluster, you need to choose the right one.

  1. For init actions, at cluster creation time, /opt/conda/default is a symbolic link to either /opt/conda/miniconda3 or /opt/conda/anaconda, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip or /opt/conda/anaconda/bin/pip install --upgrade pip.

  2. For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip for Miniconda3.

So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip for both init actions and custom images.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • While running the custom init script to upgrade the pip , getting this issue: ```tmp/dataproc-agent2708269076228529863/initialize-env.sh: /etc/google-dataproc/startup-scripts/dataproc-initialization-script-1: /usr/bin/bash: bad interpreter: No such file or directory /tmp/dataproc-agent2708269076228529863/initialize-env.sh: line 5: /etc/google-dataproc/startup-scripts/dataproc-initialization-script-1: Success ```. What are these scripts and also cannot access the cluster using ssh (disabled ). Command used is under heading """Gcloud Command with upgrade pip script""" above – codninja0908 Jan 24 '22 at 17:29
  • Followed this [link](https://stackoverflow.com/questions/38924738/google-dataproc-initialization-script-error-file-not-found?rq=1). I did not understand the placement of initialization file at specified path. what it does exactly? – codninja0908 Jan 24 '22 at 17:51
  • I don't know why initialize-env.sh was involved. Could you give more details about your custom init script? and the command you use to create custom image? – Dagang Jan 24 '22 at 17:59
  • Upgrade pip using a custom init script is resolved. It was due to carriage return character I was getting the above issue. – codninja0908 Jan 24 '22 at 18:14
  • Not able to add the Gcloud command for custom image here in comments due to character limit. Though I have kept the entire command above under the BOLD heading above ```Gcloud Command with upgrade pip script``` . Using the command I ams still not able to spin the cluster. Error mesage below – codninja0908 Jan 24 '22 at 18:16
  • Good to know it is resolved. Any other issues? – Dagang Jan 24 '22 at 18:19
  • I am still not able to spin the cluster . Please see the gcloud command above under the heading ```Gcloud Command with upgrade pip script```. Error message : ```Attempting uninstall: pyyaml Found existing installation: PyYAML 5.1.2 ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall. + sleep 5 + (( i++ )) + (( i < 10 )) + err 'Failed to run command: pip install --upgrade oyaml google-cloud-bigquery apache-airflow-providers-google apache-airflow argparse``` – codninja0908 Jan 24 '22 at 18:24
  • Did you try install the packages at custom image creation time instead of at cluster creation time? – Dagang Jan 24 '22 at 18:26
  • Could you please elaborate the difference "custom image creation time" and "cluster creation time"? We are using dataproc image 1.5.53-ubuntu18. Also check the command above . Is it the right way to spin the dataproc cluster along with init scripts – codninja0908 Jan 24 '22 at 18:30
  • In the command "Gcloud Command with upgrade pip script" above, you used init actions which run at cluster creation time to upgrade pip and install pip packages, which is only necessary if you don't do the same when you create the custom image. My recommendation: upgrade pip and install pip package with custom image script when you create the image, remove the init actions them when creating the cluster. – Dagang Jan 24 '22 at 18:53