Questions tagged [google-cloud-dataproc-serverless]

25 questions
6
votes
1 answer

Installing python packages in Serverless Dataproc GCP

I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.
5
votes
1 answer

Custom Container Image for Google Dataproc pyspark Batch Job

I am exploring newly introduced the google dataproc serverless. While sumitting job, I want to use custom images (wanted use --container-image argument) such that all my python libraries and related files already present in the server such that job…
4
votes
1 answer

Googld cloud dataproc serverless (batch) pyspark reads parquet file from google cloud storage (GCS) very slow

I have an inverse frequency parquet file of the wiki corpus on Google Cloud Storage (GCS). I want to load it from GCS to dataproc serverless (batch). However, the time to load the parquet with pyspark.read on dataproc batch is much slower than my…
4
votes
1 answer

How to force delete dataproc serverless batch

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch. The command I tried is gcloud dataproc batches delete
Afaq
  • 1,146
  • 1
  • 13
  • 25
4
votes
1 answer

Custom Image Pulled Everytime in Google Dataproc Serverless

I am using the custom image in the Dataproc Serverless. When I execute job, it is pulling image every time. This adds 1 mins extra processing time. We will be executing 1000 plus job in production and it will add lot of performance bottle neck. Is…
3
votes
2 answers

Programmatically cancelling a pyspark dataproc batch job

Using golang, I have several dataproc batch jobs running and I can access them via their Uuid by creating a client like this. BatchClient, err := dataproc.NewBatchControllerClient(context, ...options) If I wanted to delete a batch job, I could do…
3
votes
1 answer

compute.requireOsLogin violated in dataproc serverless

I am trying to create a batch in dataproc to run my job. After creating the batch it is failing with the error compute.requireOsLogin violated for project ... In my organization policy this policy(compute.requireOsLogin) is enforced. Any way to…
3
votes
1 answer

Dataproc Serverless - how to set javax.net.ssl.trustStore property to fix java.security.cert.CertPathValidatorException

Trying to use google-cloud-dataproc-serveless with spark.jars.repositories option gcloud beta dataproc batches submit pyspark sample.py --project=$GCP_PROJECT --region=$MY_REGION --properties…
2
votes
0 answers

Is there any way to get the error code and error message directly from Dataproc API

We are currently creating Dataproc clusters using below sample code, from google.cloud import dataproc_v1 def sample_create_cluster(): # Create a client client = dataproc_v1.ClusterControllerClient() # Initialize request argument(s) …
2
votes
1 answer

How to rename GCS files in Spark running on Dataproc Serverless?

After writing a spark dataframe to a file, I am attempting to rename the file using code like below: val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) val file = fs.globStatus(new Path(path +…
2
votes
1 answer

Serverless Dataproc Error- Batch ID is required

While trying to submit a spark job using Serverless Dataproc using rest API https://cloud.google.com/dataproc-serverless/docs/quickstarts/spark-batch#dataproc_serverless_create_batch_workload-drest curl -X POST \ -H "Authorization: Bearer "$(gcloud…
1
vote
1 answer

How to properly kill a running batch dataproc job?

I had run a long-running batch job in DataProc Serverless. After some time of running, I figured out that running the job any longer was a waste of time and money, and I wanted to stop it. I couldn't find a way to kill the job. However, there were…
1
vote
0 answers

Use Google Cloud Workflows to trigger Dataproc Batch job

My scenario demands an orchestration since the jobs in a flow (say a DAG) are connected/codependent. Cloud Composer is too expensive since we only have a few jobs to run (does not worth it). I've been looking around and looks like Google Cloud…
1
vote
1 answer

Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image

I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception: Exception in thread "main" org.apache.spark.SparkException: Failed to get…
1
vote
1 answer

ModuleNotFoundError: No module named 'elasticsearch' in Dataproc Serverless Pyspark job

I am trying to use elastic search package in Dataproc Serverless Spark pyspark job. I am facing issue only with this package in Dataproc Serverless. import os print("Current dir:", os.getcwd()) print("Current dir list:", os.listdir('.')) import…
1
2