Questions tagged [dataproc]

130 questions
6
votes
1 answer

Installing python packages in Serverless Dataproc GCP

I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.
4
votes
1 answer

Dataproc: Can user create workers of different instance types?

scenario: master: x1 machine type workers: x2-machine type, x3-machine type. For the above scenario: AWS EMR instance fleet allows users to create different worker instance types. From dataproc console, I noticed options is for only, N-worker…
4
votes
0 answers

Dataproc Job not giving any output

I have submitted spark job through airflow sometimes job works and sometimes it don't give output at all . Even after 2-3 hrs of waiting job is not giving any detail apart from Waiting for job output... I am using dataproc-1-4-deb10 Its simple job…
4
votes
1 answer

Apache Phoenix - GCP Data Proc

I am doing a POC on Google Cloud Dataproc along with HBase as one of the component. I created cluster and was able to get the cluster running along with the HBase service. I can list and create tables via shell. I want to use the Apache Phoenix as…
4
votes
1 answer

Is it possible that i set fully customized metric for auto scale-out with dataproc worker node in GCP (Google Cloud Platform)

Is it possible that i set fully customized metric for auto scale-out with dataproc worker node in GCP (Google Cloud Platform)?? I want to run Spark distribution processing by dataproc in GCP. But the thing is that, i just want to horizontally scale…
3
votes
0 answers

How can I cancel a dataproc job before it starts running?

I have to wait for a job to start running before I can cancel it. Is there a way to cancel the job early? Why can I not cancel a job in SETUP_DONE? Cancelling the job errors with FAILED_PRECONDITION: Cannot cancel jobId 'x' in project 'y' in state:…
stefanQ
  • 31
  • 2
3
votes
1 answer

How to enable Spark web interface on Dataproc(GCP) using DataprocCreateClusterOperator of Apache Airflow

We are using Apache Airflow's DataprocCreateClusterOperator to create Spark cluster on GCP(Dataproc) and wanted to enable Spark Web UI interfaces. When creating using terminal we pass --enable-component-gateway in the create cluster command. How can…
Anoop Deshpande
  • 514
  • 1
  • 6
  • 23
3
votes
1 answer

DataprocClusterCreateOperator doesnt have temp_bucket variable to define

I am trying to create dataproc cluster via DataprocClusterCreateOperator via Apache airflow Airflow version: 1.10.15 Composer version: 1.16.4 I wanted to assign a temp-bucket used by project to the cluster and not the bucket google creates during…
naval m
  • 61
  • 4
3
votes
1 answer

PySpark runs in YARN client mode but fails in cluster mode for "User did not initialize spark context!"

standard dataproc image 2.0 Ubuntu 18.04 LTS Hadoop 3.2 Spark 3.1 I am testing to run a very simple script on dataproc pyspark cluster: testing_dep.py import os os.listdir('./') I can run testing_dep.py in a client mode (default on dataproc) just…
3
votes
0 answers

Do env variables transfer from driver to workers?

I am using Dataproc to run my pyspark jobs. Following are the three ways that I can submit my jobs: dataproc submit command spark-submit utility provided by spark For small experimentations I can also use spark-shell Now, I have to modify a few…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
3
votes
2 answers

Where to find spark log in dataproc when running job on cluster mode

I am running the following code as job in dataproc. I could not find logs in console while running in 'cluster' mode. import sys import time from datetime import datetime from pyspark.sql import SparkSession start_time = datetime.utcnow() spark =…
Nandha
  • 752
  • 1
  • 12
  • 37
3
votes
1 answer

OSS supported by Google Cloud Dataproc

When I go to https://cloud.google.com/dataproc, I see this ... "Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks." But gcloud dataproc jobs submit…
3
votes
3 answers

Is it possible to submit a job to a cluster using initization script on Google Dataproc?

I am using Dataproc with 1 job on 1 cluster. I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below. function submit_job() { echo…
uchiiii
  • 135
  • 2
  • 7
3
votes
1 answer

Use terraform to automatically create firewall rules along with Dataproc cluster creation

I am using Terraform templates to provision a Google Cloud Dataproc cluster. After that, I'm creating firewall rules to restrict ingress traffic to those compute engine instances. I'm looking for a way to automatically create firewall rules along…
3
votes
1 answer

pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class

I created a dataproc cluster and was trying to submit my local job for testing. gcloud beta dataproc clusters create test-cluster \ --region us-central1 \ --zone us-central1-c \ --master-machine-type n1-standard-4 \ --master-boot-disk-size 500…
1
2 3
8 9