Questions tagged [dataproc]
130 questions
6
votes
1 answer
Installing python packages in Serverless Dataproc GCP
I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.

Ish14
- 69
- 2
4
votes
1 answer
Dataproc: Can user create workers of different instance types?
scenario:
master: x1 machine type
workers: x2-machine type, x3-machine type.
For the above scenario: AWS EMR instance fleet allows users to create different worker instance types. From dataproc console, I noticed options is for only, N-worker…

user2622678
- 393
- 1
- 7
- 17
4
votes
0 answers
Dataproc Job not giving any output
I have submitted spark job through airflow sometimes job works and sometimes it don't give output at all .
Even after 2-3 hrs of waiting job is not giving any detail apart from
Waiting for job output...
I am using dataproc-1-4-deb10
Its simple job…

Shubham Asabe
- 79
- 6
4
votes
1 answer
Apache Phoenix - GCP Data Proc
I am doing a POC on Google Cloud Dataproc along with HBase as one of the component.
I created cluster and was able to get the cluster running along with the HBase service. I can list and create tables via shell.
I want to use the Apache Phoenix as…

chandresh_cool
- 11,753
- 3
- 30
- 45
4
votes
1 answer
Is it possible that i set fully customized metric for auto scale-out with dataproc worker node in GCP (Google Cloud Platform)
Is it possible that i set fully customized metric for auto scale-out with dataproc worker node in GCP (Google Cloud Platform)??
I want to run Spark distribution processing by dataproc in GCP.
But the thing is that, i just want to horizontally scale…

jinsu park
- 61
- 2
3
votes
0 answers
How can I cancel a dataproc job before it starts running?
I have to wait for a job to start running before I can cancel it. Is there a way to cancel the job early? Why can I not cancel a job in SETUP_DONE?
Cancelling the job errors with FAILED_PRECONDITION: Cannot cancel jobId 'x' in project 'y' in state:…

stefanQ
- 31
- 2
3
votes
1 answer
How to enable Spark web interface on Dataproc(GCP) using DataprocCreateClusterOperator of Apache Airflow
We are using Apache Airflow's DataprocCreateClusterOperator to create Spark cluster on GCP(Dataproc) and wanted to enable Spark Web UI interfaces. When creating using terminal we pass --enable-component-gateway in the create cluster command. How can…

Anoop Deshpande
- 514
- 1
- 6
- 23
3
votes
1 answer
DataprocClusterCreateOperator doesnt have temp_bucket variable to define
I am trying to create dataproc cluster via DataprocClusterCreateOperator via Apache airflow
Airflow version: 1.10.15
Composer version: 1.16.4
I wanted to assign a temp-bucket used by project to the cluster and not the bucket google creates during…

naval m
- 61
- 4
3
votes
1 answer
PySpark runs in YARN client mode but fails in cluster mode for "User did not initialize spark context!"
standard dataproc image 2.0
Ubuntu 18.04 LTS
Hadoop 3.2
Spark 3.1
I am testing to run a very simple script on dataproc pyspark cluster:
testing_dep.py
import os
os.listdir('./')
I can run testing_dep.py in a client mode (default on dataproc) just…

figs_and_nuts
- 4,870
- 2
- 31
- 56
3
votes
0 answers
Do env variables transfer from driver to workers?
I am using Dataproc to run my pyspark jobs. Following are the three ways that I can submit my jobs:
dataproc submit command
spark-submit utility provided by spark
For small experimentations I can also use spark-shell
Now, I have to modify a few…

figs_and_nuts
- 4,870
- 2
- 31
- 56
3
votes
2 answers
Where to find spark log in dataproc when running job on cluster mode
I am running the following code as job in dataproc.
I could not find logs in console while running in 'cluster' mode.
import sys
import time
from datetime import datetime
from pyspark.sql import SparkSession
start_time = datetime.utcnow()
spark =…

Nandha
- 752
- 1
- 12
- 37
3
votes
1 answer
OSS supported by Google Cloud Dataproc
When I go to https://cloud.google.com/dataproc, I see this ...
"Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."
But gcloud dataproc jobs submit…

Naga Vijayapuram
- 845
- 7
- 11
3
votes
3 answers
Is it possible to submit a job to a cluster using initization script on Google Dataproc?
I am using Dataproc with 1 job on 1 cluster.
I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below.
function submit_job() {
echo…

uchiiii
- 135
- 2
- 7
3
votes
1 answer
Use terraform to automatically create firewall rules along with Dataproc cluster creation
I am using Terraform templates to provision a Google Cloud Dataproc cluster. After that, I'm creating firewall rules to restrict ingress traffic to those compute engine instances.
I'm looking for a way to automatically create firewall rules along…

Kuwali
- 233
- 3
- 13
3
votes
1 answer
pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
I created a dataproc cluster and was trying to submit my local job for testing.
gcloud beta dataproc clusters create test-cluster \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500…

Ahaha
- 416
- 1
- 7
- 14