I have been successful in running Python 3 / Spark 2.2.1 program in Google's Colab.Research platform :
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
!tar xf spark-2.2.1-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
this works perfectly when I uploaded text files from my local computer to the Unix VM using
from google.colab import files
datafile = files.upload()
and read them as follows :
textRDD = spark.read.text('hobbit.txt').rdd
so far so good ..
My problem starts when I am trying to read a file that is lying in my Google drive colab directory.
Following instructions I have authenticated user and created a drive service
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
after which I have been able to access the file lying in the drive as follows :
file_id = '1RELUMtExjMTSfoWF765Hr8JwNCSL7AgH'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
print('Downloaded file contents are: {}'.format(downloaded.read()))
Downloaded file contents are: b'The king beneath the mountain\r\nThe king of ......
even this works perfectly ..
downloaded.seek(0)
print(downloaded.read().decode('utf-8'))
and gets the data
The king beneath the mountain
The king of carven stone
The lord of silver fountain ...
where things FINALLY GO WRONG is where I try to grab this data and put it into a spark RDD
downloaded.seek(0)
tRDD = spark.read.text(downloaded.read().decode('utf-8'))
and I get the error ..
AnalysisException: 'Path does not exist: file:/content/The king beneath the mountain\ ....
Evidently, I am not using the correct method / parameters to read the file into spark. I have tried quite a few of the methods described
I would be very grateful if someone can help me figure out how to read this file for subsequent processing.