4

I want to use kaggle data sets from a google bucket when using colab.

First: Is there a way to directly upload kaggle data sets to google bucket via the kaggle api?

Second: How do I use data in google bucket from colab without copying it to the notebook?

At the moment my experience with using google bucket with colab is through a URI for audio transcription such as this:

gcs_uri = 'gs://bucket_name/file_name.wav'
audio = types.RecognitionAudio(uri=gcs_uri)

I'm guessing I can also do something similar for loading data into python pandas dataframe directly from a URI. My experience with using kaggle api is on my local machine, for example:

kaggle competitions download -c petfinder-adoption-prediction

Which downloads the data using the kaggle api. If I load data to a colab notebook, it is removed between sessions, so my intention in using google bucket is to have it available for multiple sessions.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Vlad
  • 3,058
  • 4
  • 25
  • 53

1 Answers1

0

You could try this solution for your first issue. Not sure if wget is possible with the data set you need, but this suggests it's possible. But this isn't via the Kaggle API.

The second question, how to use data without copying it to the notebook, is you can actually mount the bucket as a disk to your instance. Then you could access the data directly.

So putting them together you could have the bucket mounted locally, and then move the data into it. Then you can access it in the notebook.

techcyclist
  • 376
  • 1
  • 11
  • I've tried installing gcsfuse on colab using https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/installing.md which has failed. Perhaps there is a simpler way just using the URI? – Vlad Feb 22 '19 at 00:43
  • I'm going to try on my Mac, and see if it works which it should. I think it's simplest to install the gcloud SDK first, if you didn't already. – techcyclist Feb 22 '19 at 01:10
  • It seems much easier to use google drive with colab as a mount rather than google buckets https://colab.research.google.com/notebooks/io.ipynb – Vlad Feb 22 '19 at 01:18
  • If so then you might want to do just that. I tried installing on the Mac and ran into an issue that I need to disable "rootless" mode in order to make it work! So I'm in the linux Ubuntu machine and ran into a token issue that I think is because I do need Gcloud SDK first. I'll do that and see how it goes. Meanwhile, you are right, I first needed to install gcsfuse from the github repo. – techcyclist Feb 22 '19 at 02:07
  • The reason I suggested this is a former company I worked for, we used notebooks and had google buckets mounted with our data. And it worked very well. I do remember we ran into various issues getting it to work, so I'm not suggesting it's simple. With Google it seems, nothing ever is. – techcyclist Feb 22 '19 at 02:10
  • Mounting of GCS is rarely good idea. I'd read directly from GCS with e.g. `gcsfs`. – Frenzy Kiwi Feb 22 '19 at 06:51
  • Yes, if transfer speed isn't a concern. – techcyclist Feb 22 '19 at 15:15