1

I had to analyze a dataset using a google cloud clusters. I created a bucket on the google cloud platform and I created a clusters of computers, I moved my data I wanted to analyze in the bucket(and I physically checked that is was there ). I had now to create an ssh tunnel to my cluster , and I Did so by running the following codes :

%%bash    
#!/bin/bash
NODE="cluster-west1b-m"
ZONE="europe-west1-b"
PORT=8080
PROJ="myfirstproject09112018"   

gcloud compute ssh $NODE \
--project=$PROJ \
--zone=$ZONE -- -fN -L $PORT:localhost:$PORT 

After doing this I went to localhost:8080 and here I opened a python notebook , and I imported some spark libraries :

from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

And then I wanted to read my files, hence I tried to run :

natality = spark.read.csv('gs://storage-eu-west-luchino/natality/natality*.csv',header=True,inferSchema=True)

But it tells me he cannot find the file, but the file is in the bucket so I can't understand where the problem is, the error is basically this one:

Py4JJavaError: An error occurred while calling o61.csv.
: java.io.IOException: No FileSystem for scheme: gs

Does anybody have any idea of why this doesn't work? I really can't figure out the problem

1 Answers1

0

Spark doesn't understand the gs:// protocol out of the box, hence this error:

No FileSystem for scheme: gs

Instead, you can do any of the following:

Dustin Ingram
  • 20,502
  • 7
  • 59
  • 82
  • I didn't fully understand, I tried to import the blob library in the Jupiter notebook of my clusters from which I was connected with a ssh tunnel, but an error was given saying that the google.cloud library was not installed, you would suggest to use directly spark instead of python hence? And also the encryption key is something that I have to create or is something that I can retrive from the google cloud platform ? – luchino_prince Dec 03 '18 at 17:46
  • Yes, you'll need to install the [`google-cloud-storage`](https://pypi.org/project/google-cloud-storage/) package. See [this link](https://googleapis.github.io/google-cloud-python/latest/core/auth.html) for a guide on how to set up authentication to your storage bucket. – Dustin Ingram Dec 03 '18 at 18:23
  • thanks now it works! The fact the original code didn't work could it be due to the fact that I am using a Mac book? Because the original code was from my database professor and on his PC worked perfectly or it was just that he omitted parts of code of packages that he might have on his PC and I don't ? – luchino_prince Dec 03 '18 at 22:24
  • Looks like it is possible to install a third-party connector so Spark can handle GCS files. I've updated my question, see also [https://stackoverflow.com/q/46659757](https://stackoverflow.com/q/46659757/328036) and [https://stackoverflow.com/q/27782844](https://stackoverflow.com/q/27782844/328036) – Dustin Ingram Dec 03 '18 at 23:46