0

I am trying to do something which is very simple in other data services. I am trying to make a relatively simple SQL query and return it as a dataframe in python. I am on Windows 10 and using Phython 2.7 (specifically Canopy 1.7.4)

Typically this would be done with pandas.read_sql_query but due to some specifics with BigQuery they require a different method pandas.io.gbq.read_gbq

This method works fine unless you want to make a Big Query. If you make a Big Query on BigQuery you get the error


GenericGBQException: Reason: responseTooLarge, Message: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors


This was asked and answered before in this ticket but neither of the solutions are relevant for my case

Python BigQuery allowLargeResults with pandas.io.gbq

One solution is for python 3 so it is a nonstarter. The other is giving an error due to me being unable to set my credentials as an environment variable in windows.


ApplicationDefaultCredentialsError: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.


I was able to download the JSON credentials file and I have set it as an environment variable in the few ways I know how but I still get the above error. Do I need to load this in some way in python? It seems to be looking for it but unable to find is correctly. Is there a special way to set it as an environment variable in this case?

Keith
  • 4,646
  • 7
  • 43
  • 72
  • I believe this is covered by this bug on the Pandas library: https://github.com/pydata/pandas-gbq/issues/15 – Tim Swast Jun 12 '17 at 21:20
  • Sort of. The credentials method used in pandas_gbq is quite different than what needs to be used for large queries. It would require a whole rewrite. This is essentially what the code in the linked ticket does. It uses read_gbq in the case that the query is not big and then uses the JSON credentials otherwise. I do not think you can do it without the JSON credentials and this is where I get stuck. – Keith Jun 12 '17 at 21:37
  • Ah, I see. You can explicitly create service account credentials and point it at the JSON file path rather than using default credentials. It differs depending on which library you use, but to do it with `google-auth` use https://google-auth.readthedocs.io/en/latest/reference/google.oauth2.service_account.html#google.oauth2.service_account.Credentials.from_service_account_file and use the resulting credentials object to create the `bigquery.Client(credentials=my_credentials)` – Tim Swast Jun 12 '17 at 23:29
  • I did "pip install google-api-python-client" but I get "ImportError: No module named google.oauth2" when I call "from google.oauth2 import service_account". Is this the wrong package? – Keith Jun 12 '17 at 23:49
  • (edit: you are right for the module name) I think `google-auth` is an optional dependency of `google-api-python-client`, so you might have to explicitly install that, too. – Tim Swast Jun 12 '17 at 23:59

3 Answers3

2

You can do it in Python 2.7 by changing the default dialect from legacy to standard in pd.read_gbq function.

pd.read_gbq(query, 'my-super-project', dialect='standard')

Indeed, you can read in Big Query documentation for the parameter AllowLargeResults:

AllowLargeResults: For standard SQL queries, this flag is ignored and large results are always allowed.

nlassaux
  • 2,335
  • 2
  • 21
  • 35
  • Thanks so much! Just to follow up you need to change how you select the table use ` (back tick) instead of [ – Keith Jun 27 '17 at 23:27
1

I have found two ways of directly importing the JSON credentials file. Both based on the original answer in Python BigQuery allowLargeResults with pandas.io.gbq

1) Credit to Tim Swast

First

pip install google-api-python-client
pip install google-auth
pip install google-cloud-core

then replace

credentials = GoogleCredentials.get_application_default()

in create_service() with

from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/file.json')

2)

Set the environment variable manually in the code like

import os,os.path
os.environ['GOOGLE_APPLICATION_CREDENTIALS']=os.path.expanduser('path/file.json')

I prefer method 2 since it does not require new modules to be installed and is also closer to the intended use of the JSON credentials.

Note:

You must create a destinationTable and add the information to run_query()

Keith
  • 4,646
  • 7
  • 43
  • 72
0

Here is a code that fully works within python 2.7 on Windows:

import pandas as pd
my_qry="<insert your big query here>"
### Here Put the data from your credentials file of the service account - all fields are available from there###
my_file="""{
  "type": "service_account",
  "project_id": "cb4recs",
  "private_key_id": "<id>",
  "private_key": "<your private key>\n",
  "client_email": "<email>",
  "client_id": "<id>",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "<x509 url>"
  }"""

df=pd.read_gbq(qry,project_id='<your project id>',private_key=my_file)

That's it :)

Roee Anuar
  • 3,071
  • 1
  • 19
  • 33