0

I'm following this data prediction using Cloud ML Engine with scikit-learn tutorial for GCP AI Platforms. I tried to make an API call to BigQuery with:

def query_to_dataframe(query):
  import pandas as pd
  import pkgutil
  privatekey = pkgutil.get_data('trainer', 'privatekey.json')
  print(privatekey[:200])
  return pd.read_gbq(query,
                     project_id=PROJECT,
                     dialect='standard',
                     private_key=privatekey)

but got the following error:

Traceback (most recent call last):
  [...]
TypeError: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 66, in <module>
    arguments['numTrees']
  File "/root/.local/lib/python3.7/site-packages/trainer/model.py", line 119, in train_and_evaluate
    train_df, eval_df = create_dataframes(frac)
  File "/root/.local/lib/python3.7/site-packages/trainer/model.py", line 95, in create_dataframes
    train_df = query_to_dataframe(train_query)
  File "/root/.local/lib/python3.7/site-packages/trainer/model.py", line 82, in query_to_dataframe
    private_key=privatekey)
  File "/usr/local/lib/python3.7/dist-packages/pandas/io/gbq.py", line 149, in read_gbq
    credentials=credentials, verbose=verbose, private_key=private_key)
  File "/root/.local/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 846, in read_gbq
    dialect=dialect, auth_local_webserver=auth_local_webserver)
  File "/root/.local/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 184, in __init__
    self.credentials = self.get_credentials()
  File "/root/.local/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 193, in get_credentials
    return self.get_service_account_credentials()
  File "/root/.local/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 413, in get_service_account_credentials
    "Private key is missing or invalid. It should be service "
pandas_gbq.gbq.InvalidPrivateKeyFormat: Private key is missing or invalid. It should be service account private key JSON (file path or string contents) with at least two keys: 'client_email' and 'private_key'. Can be obtained from: https://console.developers.google.com/permissions/serviceaccounts

When the package runs in local environment, the private key loads fine, but when submitted as a ml-engine training job, the error occurs. Note that the private key fails to load only when I use GCP RUNTIME_VERSION="1.15" and PYTHON_VERSION="3.7", but can load with no problem when I use PYTHON_VERSION="2.7".

In case it's useful, the structure of my package is:

/babyweight
    - setup.py
    - trainer
        - __init__.py
        - model.py
        - privatekey.json
        - task.py

I'm not sure if the problem is due to a bug in Python, or where I placed privatekey.json.

gogasca
  • 9,283
  • 6
  • 80
  • 125
mlo
  • 81
  • 1
  • 7
  • 1) Convert the string to bytes: `private_key=privatekey.encode('utf-8'))`. 2) What is privatekey.json? Is this a Google Service account JSON key file or something you created? – John Hanley Jun 04 '20 at 23:05
  • 1
    I agree with @JohnHanley. To solve the bytes problem you need to encode it as utf-8. Why do you need to use this private_key attribute? Maybe you can use the credentials attribute – rmesteves Jun 05 '20 at 11:40
  • @JohnHanley yes privatekey.json is used to make calls to BigQuery API. – mlo Jun 07 '20 at 06:20
  • @rmesteves following your suggestion I changed from `private_keys` to `credentials` attribute as shown [here](https://pandas-gbq.readthedocs.io/en/latest/howto/authentication.html), and then build an absolute path to privatekey.json as shown [here](https://stackoverflow.com/questions/40416072/reading-file-using-relative-path-in-python-project/40416154). Now the job is able to run without error. – mlo Jun 07 '20 at 06:22
  • 1
    @mlo Feel free to summarize this as an answer or let me know if I can that – rmesteves Jun 08 '20 at 11:01

1 Answers1

1

I was able to solve the problem after I changed read_gbq's attribute for reading BigQuery access key from private_keys to credentials, as recommended by @rmesteves, and as shown here. I then set the value as the absolute path to privatekey.json, as shown here. Now the job is able to run without error.

Note: I only encountered this problem with Python 3+, but not with Python 2.7. I'm not sure why. It could possibly be due to the implementation of read_gbq.

mlo
  • 81
  • 1
  • 7