Creating a Spark RDD from a file located in Google Drive using Python on Colab.Research.Google

Question

I have been successful in running Python 3 / Spark 2.2.1 program in Google's Colab.Research platform :

!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
!tar xf spark-2.2.1-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

this works perfectly when I uploaded text files from my local computer to the Unix VM using

from google.colab import files
datafile = files.upload()

and read them as follows :

textRDD = spark.read.text('hobbit.txt').rdd

so far so good ..

My problem starts when I am trying to read a file that is lying in my Google drive colab directory.

Following instructions I have authenticated user and created a drive service

from google.colab import auth
auth.authenticate_user()

from googleapiclient.discovery import build
drive_service = build('drive', 'v3')

after which I have been able to access the file lying in the drive as follows :

file_id = '1RELUMtExjMTSfoWF765Hr8JwNCSL7AgH'

import io
from googleapiclient.http import MediaIoBaseDownload

request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
  # _ is a placeholder for a progress object that we ignore.
  # (Our file is small, so we skip reporting progress.)
  _, done = downloader.next_chunk()

downloaded.seek(0)
print('Downloaded file contents are: {}'.format(downloaded.read()))

Downloaded file contents are: b'The king beneath the mountain\r\nThe king of ......

even this works perfectly ..

downloaded.seek(0)
print(downloaded.read().decode('utf-8'))

and gets the data

The king beneath the mountain
The king of carven stone
The lord of silver fountain ...

where things FINALLY GO WRONG is where I try to grab this data and put it into a spark RDD

downloaded.seek(0)
tRDD = spark.read.text(downloaded.read().decode('utf-8'))

and I get the error ..

AnalysisException: 'Path does not exist: file:/content/The king beneath the mountain\ ....

Evidently, I am not using the correct method / parameters to read the file into spark. I have tried quite a few of the methods described

I would be very grateful if someone can help me figure out how to read this file for subsequent processing.

You can start by asking what does [spark.read.text()](https://www.tutorialkart.com/apache-spark/read-input-text-file-to-rdd-example/) command accept as arguments. It says in the documents that it reads "HDFS/local file system/any hadoop supported file system URI" none of which seem to be related to drive api. — ReyAnthonyRenacia, Apr 17 '18 at 09:10
so i was hoping that there would be some other method with which i can read the data. would be grateful if someone can suggest an alternative method — Calcutta, Apr 19 '18 at 10:05

score 1 · Answer 1 · answered Apr 19 '18 at 10:32

1

A complete solution to this problem is available in another StackOverflow question that is available at this URL.

Here is the notebook where this solution is demonstrated.

I have tested it and it works!

answered Apr 19 '18 at 10:32

Calcutta

1,021
3
16
36

korakot · Answer 2 · 2018-04-17T03:29:29.467

0

It seems that spark.read.text expects a file name. But you give it the file content instead. You can try either of these:

save it to a file then give the name
use just downloaded instead of downloaded.read().decode('utf-8')

You can also simplify downloading from Google Drive with pydrive. I gave an example here.

https://gist.github.com/korakot/d56c925ff3eccb86ea5a16726a70b224

Downloading is just

fid = drive.ListFile({'q':"title='hobbit.txt'"}).GetList()[0]['id']
f = drive.CreateFile({'id': fid})
f.GetContentFile('hobbit.txt')

edited Apr 17 '18 at 03:29

answered Apr 17 '18 at 03:14

korakot

37,818
16
123
144

I was trying to avoid writing the file into the VM disk and then reading it again into spark ... but if nothing else is possible, will use that – Calcutta Apr 18 '18 at 07:41

Creating a Spark RDD from a file located in Google Drive using Python on Colab.Research.Google

so far so good ..

2 Answers2