18

Currently I am working on a data set that is of 10 GB. I have uploaded it on google cloud storage but I don't know how to import it in google colab.

Shubham Tiwari
  • 392
  • 1
  • 2
  • 15

2 Answers2

19
from google.colab import auth
auth.authenticate_user()

Once you run this, a link will be generated, you can click on it and get the signing in done.

!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

Use this to install gcsfuse on colab. Cloud Storage FUSE is an open source FUSE adapter that allows you to mount Cloud Storage buckets as file systems on Colab, Linux or macOS systems.

!mkdir folderOnColab
!gcsfuse folderOnBucket/content/ folderOnColab

Use this to mount the directories. (folderOnBucket is the GCS bucket URL without the gs:// part)

You can use this docs for further reading. https://cloud.google.com/storage/docs/gcs-fuse

Tharaka Devinda
  • 1,952
  • 21
  • 23
55597
  • 2,033
  • 1
  • 21
  • 40
  • 3
    The solution works great! In case, gcp data/folders are still not visible in google colab folder. Add --implicit-dirs flag. more info here -> https://stackoverflow.com/a/38319745/4533505 – Chandan Kumar Apr 18 '20 at 18:07
3

Using a dedicated service account and Python:

from google.oauth2 import service_account
from google.cloud.storage import client
import io
import pandas as pd
from io import BytesIO
import json
import filecmp

Using the service account token as str:

SERVICE_ACCOUNT = json.loads(r"""{
  "type": "service_account",
  "project_id": "[REPLACE WITH YOUR FILE]",
  "privat_sae_key_id": "[REPLACE WITH YOUR FILE]",
  "private_key": "[REPLACE WITH YOUR FILE]",
  "client_email": "[REPLACE WITH YOUR FILE]",
  "client_id": "[REPLACE WITH YOUR FILE]",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "[REPLACE WITH YOUR FILE]"
}""")

BUCKET = "[NAME OF YOUR BUCKET TO READ/WITE YOUR DATA]"

Using the service token to create the client:

credentials = service_account.Credentials.from_service_account_info(
    SERVICE_ACCOUNT,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client = client.Client(
    credentials=credentials,
    project=credentials.project_id,
)

Save and download functions:

def save_file(local_filename, remote_filename):
    bucket = client.get_bucket(BUCKET)
    blob = bucket.blob(remote_filename)
    blob.upload_from_filename(local_filename)

def download_file(local_filename, remote_filename):
    bucket = client.get_bucket(BUCKET)
    blob = bucket.blob(remote_filename)
    blob.download_to_filename(local_filename)

Let's check with a CSV file generated by Pandas:

df_test = pd.DataFrame(
    {"col1": [1,2,3],
     "col2": [4,5,6]}
).to_csv(path_or_buf="/tmp/test.csv")

save_file("/tmp/test.csv","test.csv")
download_file("/tmp/test2.csv","test.csv")
assert filecmp.cmp('/tmp/test.csv', '/tmp/test2.csv')
Kartoch
  • 7,610
  • 9
  • 40
  • 68