Questions tagged [google-cloud-dataproc-serverless]
25 questions
6
votes
1 answer
Installing python packages in Serverless Dataproc GCP
I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.

Ish14
- 69
- 2
5
votes
1 answer
Custom Container Image for Google Dataproc pyspark Batch Job
I am exploring newly introduced the google dataproc serverless. While sumitting job, I want to use custom images (wanted use --container-image argument) such that all my python libraries and related files already present in the server such that job…

konkodi
- 145
- 3
4
votes
1 answer
Googld cloud dataproc serverless (batch) pyspark reads parquet file from google cloud storage (GCS) very slow
I have an inverse frequency parquet file of the wiki corpus on Google Cloud Storage (GCS). I want to load it from GCS to dataproc serverless (batch). However, the time to load the parquet with pyspark.read on dataproc batch is much slower than my…

Sam
- 83
- 1
- 1
- 4
4
votes
1 answer
How to force delete dataproc serverless batch
I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch.
The command I tried is
gcloud dataproc batches delete …

Afaq
- 1,146
- 1
- 13
- 25
4
votes
1 answer
Custom Image Pulled Everytime in Google Dataproc Serverless
I am using the custom image in the Dataproc Serverless. When I execute job, it is pulling image every time. This adds 1 mins extra processing time. We will be executing 1000 plus job in production and it will add lot of performance bottle neck.
Is…

konkodi
- 145
- 3
3
votes
2 answers
Programmatically cancelling a pyspark dataproc batch job
Using golang, I have several dataproc batch jobs running and I can access them via their Uuid by creating a client like this.
BatchClient, err := dataproc.NewBatchControllerClient(context, ...options)
If I wanted to delete a batch job, I could do…

David Gamboa
- 116
- 1
- 9
3
votes
1 answer
compute.requireOsLogin violated in dataproc serverless
I am trying to create a batch in dataproc to run my job. After creating the batch it is failing with the error compute.requireOsLogin violated for project ...
In my organization policy this policy(compute.requireOsLogin) is enforced. Any way to…

Help_me_a_bit
- 103
- 5
3
votes
1 answer
Dataproc Serverless - how to set javax.net.ssl.trustStore property to fix java.security.cert.CertPathValidatorException
Trying to use google-cloud-dataproc-serveless with spark.jars.repositories option
gcloud beta dataproc batches submit pyspark sample.py --project=$GCP_PROJECT --region=$MY_REGION --properties…

Ranga Vure
- 1,922
- 3
- 16
- 23
2
votes
0 answers
Is there any way to get the error code and error message directly from Dataproc API
We are currently creating Dataproc clusters using below sample code,
from google.cloud import dataproc_v1
def sample_create_cluster():
# Create a client
client = dataproc_v1.ClusterControllerClient()
# Initialize request argument(s)
…

ash_ketchum12
- 73
- 6
2
votes
1 answer
How to rename GCS files in Spark running on Dataproc Serverless?
After writing a spark dataframe to a file, I am attempting to rename the file using code like below:
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val file = fs.globStatus(new Path(path +…

Daniel Fletemier
- 31
- 3
2
votes
1 answer
Serverless Dataproc Error- Batch ID is required
While trying to submit a spark job using Serverless Dataproc using rest API
https://cloud.google.com/dataproc-serverless/docs/quickstarts/spark-batch#dataproc_serverless_create_batch_workload-drest
curl -X POST \
-H "Authorization: Bearer "$(gcloud…

Ranga Vure
- 1,922
- 3
- 16
- 23
1
vote
1 answer
How to properly kill a running batch dataproc job?
I had run a long-running batch job in DataProc Serverless. After some time of running, I figured out that running the job any longer was a waste of time and money, and I wanted to stop it.
I couldn't find a way to kill the job. However, there were…

Aman Ranjan Verma
- 183
- 1
- 10
1
vote
0 answers
Use Google Cloud Workflows to trigger Dataproc Batch job
My scenario demands an orchestration since the jobs in a flow (say a DAG) are connected/codependent. Cloud Composer is too expensive since we only have a few jobs to run (does not worth it).
I've been looking around and looks like Google Cloud…

no-stale-reads
- 177
- 1
- 1
- 11
1
vote
1 answer
Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image
I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception:
Exception in thread "main" org.apache.spark.SparkException: Failed to get…

Sophie192
- 13
- 3
1
vote
1 answer
ModuleNotFoundError: No module named 'elasticsearch' in Dataproc Serverless Pyspark job
I am trying to use elastic search package in Dataproc Serverless Spark pyspark job. I am facing issue only with this package in Dataproc Serverless.
import os
print("Current dir:", os.getcwd())
print("Current dir list:", os.listdir('.'))
import…

ash_ketchum12
- 73
- 6