6

Here is what i tried: (ipython notebook, with python2.7)

import gcp
import gcp.storage as storage
import gcp.bigquery as bq
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

sample_bucket_name = gcp.Context.default().project_id + '-datalab'
sample_bucket_path = 'gs://' + sample_bucket_name 
sample_bucket_object = sample_bucket_path + '/myFile.csv'
sample_bucket = storage.Bucket(sample_bucket_name)
df = bq.Query(sample_bucket_object).to_dataframe()

Which fails.
would you have any leads what i am doing wrong ?

Cy Bu
  • 1,401
  • 2
  • 22
  • 33

3 Answers3

9

Based on the datalab source code bq.Query() is primarily used to execute BigQuery SQL queries. In in terms of reading a file from Google Cloud Storage (GCS), one potential solution is to use the datalab %gcs line magic function to read the csv from GCS into a local variable. Once you have the data in a variable, you can then use the pd.read_csv() function to convert the csv formatted data into a pandas DataFrame. The following should work:

import pandas as pd
from StringIO import StringIO

# Read csv file from GCS into a variable
%gcs read --object gs://cloud-datalab-samples/cars.csv --variable cars

# Store in a pandas dataframe
df = pd.read_csv(StringIO(cars))

There is also a related stackoverflow question at the following link: Reading in a file with Google datalab

Anthonios Partheniou
  • 1,699
  • 1
  • 15
  • 25
8

In addition to @Flair's comments about %gcs, I got the following to work for the Python 3 kernel:

    import pandas as pd
    from io import BytesIO

    %gcs read --object "gs://[BUCKET ID]/[FILE].csv" --variable csv_as_bytes

    df = pd.read_csv(BytesIO(csv_as_bytes))
    df.head()
Tony
  • 186
  • 1
  • 4
0

You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.

Make sure you have Dask is installed.

conda install dask #conda

pip install dask[complete] #pip

import dask.dataframe as dd #Import

dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data

dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data

This is all you need to load the data.

You can filter and manipulate data with Pandas syntax now.

dataframe['z'] = dataframe.x + dataframe.y

dataframe_pd = dataframe.compute()