1

I have an initialization script which I am running while creating a Dataproc cluster. The script copies a python wheel package from GCS into the cluster and then installs the wheel on the cluster. This seemed to be working fine just few weeks ago but today when I am creating the cluster with a new version of the wheel it fails with below error in the dataproc logs. The changes to the wheel is only in the python packages(code) and its a pure python wheel. I am using dataproc image version 1.5.53-debian10

pip is already installed.
Traceback (most recent call last):
    sys.exit(main())
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_internal/main.py", line 45, in main
    command = create_command(cmd_name, isolated=("--isolated" in cmd_args))
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_internal/commands/__init__.py", line 96, in create_command
    module = importlib.import_module(module_path)
  File "/opt/conda/miniconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 23, in <module>
    from pip._internal.cli.req_command import RequirementCommand
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_internal/cli/req_command.py", line 20, in <module>
    from pip._internal.network.session import PipSession
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_internal/network/session.py", line 17, in <module>
    from pip._vendor import requests, six, urllib3
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_vendor/requests/__init__.py", line 97, in <module>
    from pip._vendor.urllib3.contrib import pyopenssl
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/pip/_vendor/urllib3/contrib/pyopenssl.py", line 46, in <module>
    import OpenSSL.SSL
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/OpenSSL/crypto.py", line 1553, in <module>
    class X509StoreFlags(object):
  File "/opt/conda/miniconda3/lib/python3.7/site-packages/OpenSSL/crypto.py", line 1573, in X509StoreFlags
    CB_ISSUER_CHECK = _lib.X509_V_FLAG_CB_ISSUER_CHECK
AttributeError: module 'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK'

It seems to be some issue with the python and pip packages in dataproc. Can anyone suggest how to resolve this issue. The initialization script I am using is as follows.

#!/bin/bash 
function install_pip() {
  if command -v pip >/dev/null; then
      echo "pip is already installed."
      return 0
  fi   
  if command -v easy_install >/dev/null; then
    echo "Installing pip with easy_install..."
    easy_install pip
    return 0
  fi   
  echo "Installing python-pip..."
  apt update
  apt install python-pip -y
} 
# install pip in the cluster
install_pip 

# Get the GCS location for SDK from cluster metadata attributes
SDK_GCS_LOCATION="$(/usr/share/google/get_metadata_value attributes/sdk-gcs-location)"
SDK_FILE_NAME="$(/usr/share/google/get_metadata_value attributes/sdk-file-name)"
readonly SDK_GCS_LOCATION
readonly SDK_FILE_NAME
SDK_WHEEL=$SDK_GCS_LOCATION/$SDK_FILE_NAME

# Copy wheel file from GCS to cluster
gsutil cp $SDK_WHEEL .

# Install wheel in cluster
pip install $SDK_FILE_NAME
Rajnil Guha
  • 425
  • 1
  • 4
  • 15
  • Can you try running the command `pip3 install pyOpenSSL --upgrade`? For more information you can refer to this [stack link](https://stackoverflow.com/questions/73830524). – Prajna Rai T Feb 03 '23 at 08:14
  • @PrajnaRaiT I tried the same thing from the initialization script itself but didn't work and I got the same error. But I tried it in a different way and it worked. I will put my solution in the answer. If others have tried something else please feel free to jump in. – Rajnil Guha Feb 06 '23 at 11:22

1 Answers1

1

I tried by adding pyOpenSSL==23.0.0 which is the latest version of this package into the dataproc:pip.packages config of the airflow DAG which I used to create the cluster in addition with the other packages as below:

"software_config": {
        "image_version": "1.5.53-debian10",
        "properties": {
            "dataproc:efm.spark.shuffle": "primary-worker",
            "dataproc:efm.mapreduce.shuffle": "hcfs",
            "dataproc:pip.packages": "pyOpenSSL==23.0.0",
        },
    },

This worked as the package got upgraded to the latest version which is compatible with pip. I was expecting @Dataproc/@google-cloud-platform to manage these libraries as the python and pip comes pre-installed within dataproc clusters.

Rajnil Guha
  • 425
  • 1
  • 4
  • 15