I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.
-
The official documentation is not helpful: https://cloud.google.com/dataproc/docs/tutorials/python-configuration? – Florian Salihovic Apr 27 '22 at 19:27
-
Yeah, and this documentation is not for serverless dataproc. – Ish14 Apr 27 '22 at 21:27
-
1I'm wondering the same thing. I guess one way would be to create a docker image with the deps baked in, but surely there's a better way. – surj Apr 30 '22 at 22:27
-
I guess this is one possible solution to create a custom docker image: https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers?hl=en#example_custom_container_image_build – Ish14 Jun 09 '22 at 20:49
-
This blog post was really useful to me. https://medium.com/cts-technologies/running-pyspark-jobs-on-google-cloud-using-serverless-dataproc-f16cef5ec6b9 – teramonagi Aug 30 '22 at 04:45
1 Answers
You have two options:
- Using command gcloud in terminal:
You can create a custom image with dependencies(python packages) in the GCR(Google Container Registry GCP) and add uri as parameter in the command below:
e.g.
$ gcloud beta dataproc batches submit
--container-image=gcr.io/my-project-id/my-image:1.0.1
--project=my-project-id --region=us-central1
--jars=file:///usr/lib/spark/external/spark-avro.jar
--subnet=projects/my-project-id/regions/us-central1/subnetworks/my- subnet-name
To create custom container image for Dataproc Serveless for Spark.
- Using operator DataprocCreateBatchOperator of airflow:
Add to python-file the script below, it will install the desired package and then load this package into the container path (dataproc servless), this file must be saved in a bucket, this uses the secret manager package as an example.
python-file.py
import pip import importlib from warnings import warn from dataclasses import dataclass def load_package(package, path): warn("Update path order. Watch out for importing errors!") if path not in sys.path: sys.path.insert(0,path) module = importlib.import_module(package) return importlib.reload(module) @dataclass class PackageInfo: import_path: str pip_id: str packages = [PackageInfo("google.cloud.secretmanager","google-cloud-secret-manager==2.4.0")] path = '/tmp/python_packages' pip.main(['install', '-t', path, *[package.pip_id for package in packages]]) for package in packages: load_package(package.import_path, path=path) ...
finally the perator calls the python-file.py
create_batch = DataprocCreateBatchOperator( task_id="batch_create",
batch={ "pyspark_batch": { "main_python_file_uri": "gs://bucket-name/python-file.py", "args": [ "value1", "value2" ], "jar_file_uris": "gs://bucket-name/jar-file.jar", },
"environment_config": { "execution_config": { "subnetwork_uri": "projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name" },
}, }, batch_id="batch-create", )

- 21
- 2
-
The first one is what Google recommends. The second suggestion only installs the dependency on the driver. There has to be an easier way to do it rather than having to create a custom image that expires in 60 days. – dvarelas Nov 29 '22 at 15:59
-
@dvarelas when you say that image expires after 60 days, i believe you mean a regular, dataproc image (which is a compute disk image). For dataproc serverless, gcr docker images are used, which do not expire. – Jedrzej G Dec 08 '22 at 06:56