I had to analyze a dataset using a google cloud clusters. I created a bucket on the google cloud platform and I created a clusters of computers, I moved my data I wanted to analyze in the bucket(and I physically checked that is was there ). I had now to create an ssh tunnel to my cluster , and I Did so by running the following codes :
%%bash
#!/bin/bash
NODE="cluster-west1b-m"
ZONE="europe-west1-b"
PORT=8080
PROJ="myfirstproject09112018"
gcloud compute ssh $NODE \
--project=$PROJ \
--zone=$ZONE -- -fN -L $PORT:localhost:$PORT
After doing this I went to localhost:8080 and here I opened a python notebook , and I imported some spark libraries :
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
And then I wanted to read my files, hence I tried to run :
natality = spark.read.csv('gs://storage-eu-west-luchino/natality/natality*.csv',header=True,inferSchema=True)
But it tells me he cannot find the file, but the file is in the bucket so I can't understand where the problem is, the error is basically this one:
Py4JJavaError: An error occurred while calling o61.csv.
: java.io.IOException: No FileSystem for scheme: gs
Does anybody have any idea of why this doesn't work? I really can't figure out the problem