7

I have gotten a few Notebooks up and going on DataLab. I'd like, for a variety of reasons to access the same data from a local Jupyter notebook on my machine.

This question suggested a few approaches which so far I can't get working.

Specifically The Gcloud library:

from gcloud import bigquery
client = bigquery.Client()

Give me a stack trace the last line of which:

ContextualVersionConflict: (protobuf 2.6.1 (/usr/local/lib/python2.7/dist-packages), Requirement.parse('protobuf!=3.0.0.b2.post1,>=3.0.0b2'), set(['gcloud']))

The Pandas library seems promising:

df=pd.io.gbq.read_gbq('SELECT CCS_Category_ICD9, Gender, Admit_Month FROM [xxxxxxxx-xxxxx:xxxx_100MB_newform.xxxxxx_100MB_newform]ORDER by CCS_Category_ICD9',
                 project_id='xxxxxxxx-xxxxx')

Also gives me a stack trace:

IOError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/httplib2-0.9.1.dist-info/METADATA'

Perhaps I have an auth issue on the Pandas approach, although my browser is currently Auth'd to the project? or am I missing a dependency?

Any suggestions or guidance appreciated..

What is the best way to access A BigQuery Datasource from within a local Jupyter notebook?

Community
  • 1
  • 1
dartdog
  • 10,432
  • 21
  • 72
  • 121

4 Answers4

9

Based on the error from gbq.read() , it appears that httplib2 may not be correctly installed. On the pandas installation page, there are a few optional dependencies which are required for Google BigQuery support (httplib2 is one of them). To re-install/repair the installation try:

pip install httplib2 --ignore-installed

Once the optional dependencies for Google BigQuery support are installed, the following code should work:

from pandas.io import gbq
df = gbq.read_gbq('SELECT * FROM MyDataset.MyTable', project_id='my-project-id')
Anthonios Partheniou
  • 1,699
  • 1
  • 15
  • 25
  • Darn,, I wish that was it, Pip says it is intslled and I also tried pip --upgrade, same result, already satisfied.. :-( – dartdog May 17 '16 at 22:28
  • Does the --ignore-installed option help? – Anthonios Partheniou May 17 '16 at 22:33
  • But!! Removing the project id string I had in front of the table name got things going, so in short I had a very bad format on the Select clause! – dartdog May 17 '16 at 22:36
  • 1
    If you also want to provide a project ID inside the SQL string, try SELECT * FROM [project-id:MyDataset.MyTable] – Anthonios Partheniou May 17 '16 at 22:43
  • @AnthoniosPartheniou: I get the error "response too large to return". Is there a setting in read_gbq to overcome this error. Thanks. – user3447653 Dec 09 '16 at 16:48
  • @user3447653 Please open a new StackOverflow question. I can provide a more detailed answer. Short Answer: You're returning too much data. If you really want a lot of data, and you want to you use pandas, you would need [this pull request](https://github.com/pandas-dev/pandas/pull/14742) in order to set addition configuration options which are `allowLargeResults` and `destinationTable` . [This](http://stackoverflow.com/questions/20287938/error-response-too-large-to-return-in-big-query) StackOverflow link may also be helpful. Another option is to extract your results to Google Cloud Storage. – Anthonios Partheniou Dec 10 '16 at 15:59
1

If you were using Datalab-specific ways of accessing GCP, then you may want to try using https://github.com/googledatalab/datalab instead. That will give you Datalab-compatible functionality within Jupyter Notebook.

Graham Wheeler
  • 2,734
  • 1
  • 19
  • 23
  • And an easy-peasy `docker run -it -p "127.0.0.1:8081:8080" -v $PWD:"/content" gcr.io/cloud-datalab/datalab:local` will setup the environment without the need for a build. – nilleb Aug 11 '17 at 12:47
0

i had the same issue but managed to solve it by installing the conda version of gbq, i already had installed anaconda distribution of python so i guess there maybe some link missing if you use pip

conda install pandas-gbq --channel conda-forge this command did the business

TICH
  • 17
  • 1
0

I have one example here: https://github.com/kanjih-ciandt/docker-jupyter-gcloud/blob/master/ka.ipynb

But, basically you first need some packages installed:

!pip install google-cloud --user
!pip install --upgrade google-cloud-bigquery[pandas] --user
!pip install google-cloud-storage --user

If you already have a service account file just execute this (replacing JSON_SERVICE_ACCOUNT_FILE):

import logging
import json
import os
from datetime import datetime
import pprint

from googleapiclient import discovery
from oauth2client.service_account import ServiceAccountCredentials

# Default scope to get access token
_SCOPE = 'https://www.googleapis.com/auth/cloud-platform'
from google.cloud import bigquery

client = bigquery.Client.from_service_account_json(JSON_SERVICE_ACCOUNT_FILE)
# Perform a query.
QUERY = (
    'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
    'WHERE state = "TX" '
    'LIMIT 100')
query_job = client.query(QUERY)  # API request
rows = query_job.result()  # Waits for query to finish

for row in rows:
    print(row.name)

But, if you have access to some GCP project, but don't know how to create a service account you can create it directly in your jupyter notebook:

SERVICE_ACCOUNT='jupytersa'
JSON_SERVICE_ACCOUNT_FILE = 'sa1.json'
GCP_PROJECT_ID='<GCP_PROJECT_ID>' 

import subprocess
import sys
import logging

logger = logging.Logger('catch_all')


def run_command(parameters):

    try:
        return subprocess.check_output(parameters)
    except BaseException as e: 
       logger.error(e) 
       logger.error('ERROR: Looking in jupyter console for more information')

run_command([
        'gcloud', 'iam', 'service-accounts',
        'create', SERVICE_ACCOUNT,
        '--display-name', "Service Account for BETA SCC API",
        '--project', GCP_PROJECT_ID
])


IAM_ROLES = [
    'roles/editor'
]

for role in IAM_ROLES:
    run_command([
        'gcloud', 'projects', 'add-iam-policy-binding',GCP_PROJECT_ID,
        '--member', 'serviceAccount:{}@{}.iam.gserviceaccount.com'.format(SERVICE_ACCOUNT, GCP_PROJECT_ID),
        '--quiet',  '--role', role
    ])


run_command([
        'gcloud', 'iam', 'service-accounts',
        'keys', 'create', JSON_SERVICE_ACCOUNT_FILE ,
        '--iam-account', 
        '{}@{}.iam.gserviceaccount.com'.format(SERVICE_ACCOUNT, GCP_PROJECT_ID)
])

The full example you can found here: https://github.com/kanjih-ciandt/docker-jupyter-gcloud/blob/master/ka.ipynb

To conclude, if you want to execute this notebook from Docker you can use this image: https://cloud.docker.com/u/hkanjih/repository/docker/hkanjih/docker-jupyter-gcloud

hkanjih
  • 1,271
  • 1
  • 11
  • 29