I have a series of questions (sorry Google documentation is awful and not-user friendly):
- What is the equivalent of Amazon EMR on Google Cloud, Dataproc? I'm using this documentation to run a Spark job: https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial
- Can you ssh into the head machine and run a Spark in the entire cluster or you have use Google's
gcloud dataproc jobs submit ...
command? - When I run a Spark job locally and try to access Google Cloud Storage I do so without a problem. When I try to use Dataproc it crashes.
I have read:
- https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector
- reading google bucket data in spark
- "No Filesystem for Scheme: gs" when running spark job locally
I have tried so far:
- I have placed
gcs-connector-hadoop2-latest.jar
andmy_project.json
on my master and worker nodes in/etc/hadoop/conf
I have added the following, on my master and worker nodes, to
/etc/hadoop/conf/core-site.xml
:<property> <name>google.cloud.auth.service.account.enable</name> <value>true</value> </property> <property> <name>my_project.json</name> <value>full path to JSON keyfile downloaded for service account</value> </property>
I tried running the following commands:
sudo gcloud dataproc jobs submit pyspark spark.py --cluster=${CLUSTER}
and
sudo gcloud dataproc jobs submit pyspark \ --jars /etc/hadoop/conf/gcs-connector-hadoop2-latest.jar \ spark.py --cluster=${CLUSTER}
- I keep getting the following error:
No FileSystem for scheme: gs
I do not know what to do next.