How to get PySpark working on Google Cloud Dataproc cluster

Question

I have a series of questions (sorry Google documentation is awful and not-user friendly):

What is the equivalent of Amazon EMR on Google Cloud, Dataproc? I'm using this documentation to run a Spark job: https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial
Can you ssh into the head machine and run a Spark in the entire cluster or you have use Google's gcloud dataproc jobs submit ... command?
When I run a Spark job locally and try to access Google Cloud Storage I do so without a problem. When I try to use Dataproc it crashes.

I have read:

I have tried so far:

I have placed gcs-connector-hadoop2-latest.jar and my_project.json on my master and worker nodes in /etc/hadoop/conf

I have added the following, on my master and worker nodes, to /etc/hadoop/conf/core-site.xml:

<property>
  <name>google.cloud.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>my_project.json</name>
  <value>full path to JSON keyfile downloaded for service account</value>
</property>

I tried running the following commands:

sudo gcloud dataproc jobs submit pyspark  spark.py --cluster=${CLUSTER}

and

sudo gcloud dataproc jobs submit pyspark \
    --jars  /etc/hadoop/conf/gcs-connector-hadoop2-latest.jar \
    spark.py --cluster=${CLUSTER}

I keep getting the following error:

No FileSystem for scheme: gs

I do not know what to do next.

Igor Dvorzhak · Answer 1 · 2019-01-18T07:28:02.210

Yes, Google Dataproc is an equivalent of AWS EMR.
Yes, you can ssh into the Dataproc master node with gcloud compute ssh ${CLUSTER}-m command and submit Spark jobs manually, but it's recommended to use Dataproc API and/or gcloud command to submit jobs to Dataproc cluster. Note, you can use gcloud command to submit jobs to Dataproc cluster from any machine that has gcloud installed, you don't need to do this from Google Cloud VM, e.g. Dataproc master node.
To access Google Cloud Storage (GCS) from job submitted to Dataproc cluster you don't need to perform any configuration (Dataproc has pre-installed GCS connector and it's already configured to access GCS).

You can submit PySpark job on Dataproc cluster with the command (note, first you need to copy your PySpark job file to GCS and use it when submitting Dataproc job):

gsutil cp spark.py gs://<BUCKET>/path/spark.py
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
    gs://<BUCKET>/path/spark.py

@user1871528 May you share you `spark.py` script? How do you access GCS there? — Igor Dvorzhak, Jan 22 '19 at 16:02

How to get PySpark working on Google Cloud Dataproc cluster

1 Answers1