8

I am trying to use Google Datalab to read in a file in ipython notebook, the basic pd.read_csv() as I can't find the path of the file. I have it locally and also uploaded it to google cloud storage in a bucket.

I ran the following commands to understand where I am

os.getcwd()

gives '/content/myemail@gmail.com'

os.listdir('/content/myemail@gmail.com')

gives ['.git', '.gitignore', 'datalab', 'Hello World.ipynb', '.ipynb_checkpoints']

vvv
  • 337
  • 3
  • 8

3 Answers3

10

The following reads the contents of the object into a string variable called text:

%%storage read --object "gs://path/to/data.csv" --variable text

Then

from cStringIO import StringIO
mydata = pd.read_csv(StringIO(text)) 
mydata.head()

Hopefully Pandas will support "gs://" URLs (as it does for s3:// currently to allow reading directly from Google Cloud storage.

I have found the following docs really helpful:

https://github.com/GoogleCloudPlatform/datalab/tree/master/content/datalab/tutorials

Hope that helps (just getting started with Datalab too, so maybe someone will have a cleaner method soon).

Chris
  • 430
  • 2
  • 11
  • I get this ERROR: Cell magic `%%storage` not found (But line magic `%storage` exists, did you mean that instead?)? – vvv Jan 12 '16 at 03:59
  • also looks like I have to specify the path, but that is what is unknown to me :) – vvv Jan 12 '16 at 04:00
  • `%%storage` does work for me. The two bits of code are in separate cells in the notebook, `%%` is a cell magic. Just to clarify the path, `gs://path/the/data.csv` points to the file on Google Cloud Storage in you bucket, not locally on your laptop, so the one you uploaded. `gs://bucket/file.csv` – Chris Jan 12 '16 at 14:01
  • My Datalab version (from clicking the "i" top right) gives: Version: 0.5.20151127 Based on Jupyter (formerly IPython) 4, incase there are any version mis-matches going on. – Chris Jan 12 '16 at 14:03
  • ok so it started working for whatever reason %%storage read --object "https://console.cloud.google.com/m/cloudstorage/b/project1/o/dataset.csv" gives an error of storage read: error: argument -v/--variable is required; tried to put a variable name beefore and after but doesn't seem to work – vvv Jan 12 '16 at 20:54
  • Did you include the --variable argument? That specifies the name of the Python variable that will be assigned the content of the GCS object. – Graham Wheeler Mar 04 '16 at 00:57
  • @Chris the github link is down.. also is there a way to specify the encoding? – supersan Jun 18 '17 at 15:35
  • It looks like datalab has moved to its own home https://github.com/googledatalab/notebooks – Chris Jun 18 '17 at 17:25
1

You can also run BigQuery queries directly against CSV files in Cloud Storage by creating a FederatedTable wrapper object. That is described here:

https://github.com/GoogleCloudPlatform/datalab/blob/master/content/datalab/tutorials/BigQuery/Using%20External%20Tables%20from%20BigQuery.ipynb

Graham Wheeler
  • 2,734
  • 1
  • 19
  • 23
  • but that requires to specify the path? I am just confused as to where this csv file I upload to storage 'lives' – vvv Jan 12 '16 at 03:58
0

I uploaded my Iris.csv to my datalab root directory.

Then like you mentioned in your question ran the following code cell.

os.getcwd()

I got '/content/datalab/docs'

Then i ran the following code cell.

iris = pd.read_csv('/content/datalab/Iris.csv')
print(iris)

It worked for me.

Kartik Podugu
  • 144
  • 1
  • 5