Installing python packages in Serverless Dataproc GCP

Question

I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.

The official documentation is not helpful: https://cloud.google.com/dataproc/docs/tutorials/python-configuration? — Florian Salihovic, Apr 27 '22 at 19:27
Yeah, and this documentation is not for serverless dataproc. — Ish14, Apr 27 '22 at 21:27
I'm wondering the same thing. I guess one way would be to create a docker image with the deps baked in, but surely there's a better way. — surj, Apr 30 '22 at 22:27
I guess this is one possible solution to create a custom docker image: https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers?hl=en#example_custom_container_image_build — Ish14, Jun 09 '22 at 20:49
This blog post was really useful to me. https://medium.com/cts-technologies/running-pyspark-jobs-on-google-cloud-using-serverless-dataproc-f16cef5ec6b9 — teramonagi, Aug 30 '22 at 04:45

score 2 · Answer 1 · answered Sep 02 '22 at 14:53

You have two options:

Using command gcloud in terminal:

You can create a custom image with dependencies(python packages) in the GCR(Google Container Registry GCP) and add uri as parameter in the command below:
e.g.

$ gcloud beta dataproc batches submit
--container-image=gcr.io/my-project-id/my-image:1.0.1
--project=my-project-id --region=us-central1
--jars=file:///usr/lib/spark/external/spark-avro.jar
--subnet=projects/my-project-id/regions/us-central1/subnetworks/my- subnet-name

To create custom container image for Dataproc Serveless for Spark.

Using operator DataprocCreateBatchOperator of airflow:

Add to python-file the script below, it will install the desired package and then load this package into the container path (dataproc servless), this file must be saved in a bucket, this uses the secret manager package as an example.

python-file.py

import pip
import importlib
from warnings import warn
from dataclasses import dataclass

def load_package(package, path):

    warn("Update path order. Watch out for importing errors!")
    if path not in sys.path:
        sys.path.insert(0,path)

    module = importlib.import_module(package)
    return importlib.reload(module)

@dataclass
class PackageInfo:
    import_path: str
    pip_id: str

packages = [PackageInfo("google.cloud.secretmanager","google-cloud-secret-manager==2.4.0")]
path = '/tmp/python_packages'
pip.main(['install', '-t', path, *[package.pip_id for package in packages]])

for package in packages:
    load_package(package.import_path, path=path)      

...

finally the perator calls the python-file.py

create_batch = DataprocCreateBatchOperator( task_id="batch_create",
batch={ "pyspark_batch": { "main_python_file_uri": "gs://bucket-name/python-file.py", "args": [ "value1", "value2" ], "jar_file_uris": "gs://bucket-name/jar-file.jar", },
"environment_config": { "execution_config": { "subnetwork_uri": "projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name" },
}, }, batch_id="batch-create", )

The first one is what Google recommends. The second suggestion only installs the dependency on the driver. There has to be an easier way to do it rather than having to create a custom image that expires in 60 days. — dvarelas, Nov 29 '22 at 15:59
@dvarelas when you say that image expires after 60 days, i believe you mean a regular, dataproc image (which is a compute disk image). For dataproc serverless, gcr docker images are used, which do not expire. — Jedrzej G, Dec 08 '22 at 06:56

Installing python packages in Serverless Dataproc GCP

1 Answers1