Reading data from MySQL and writing in GCP Bucket using Apache Beam Python API

Question

I'm trying to read data from MySQL database (located in GCP) and writing the same in GCP Bucket. I want to use Python SDK for the same. Below is the code I've written.

from __future__ import generators 
import apache_beam as beam
import time
import jaydebeapi 
import os
import argparse
from google.cloud import bigquery
import logging
import sys
from string import lower
from google.cloud import storage as gstorage
import pandas as  pd
from oauth2client.client import GoogleCredentials

print("Import Successful")

class setenv(beam.DoFn): 
      def process(self,context):
          os.system('gsutil cp gs://<MY_BUCKET>/mysql-connector-java-5.1.45.jar /tmp/')
          logging.info('Enviornment Variable set.')
        
class readfromdatabase(beam.DoFn): 
      def process(self, context):
          logging.info('inside readfromdatabase')
          database_user='root'
          database_password='<DB_PASSWORD>'
          database_host='<DB_HOST>'
          database_port='<DB_PORT>'
          database_db='<DB_NAME>'
          logging.info('reached readfromdatabase 1')        
          jclassname = "com.mysql.jdbc.Driver"
          url = ("jdbc:mysql://{0}:{1}/{2}?user={3}&password={4}".format(database_host,database_port, database_db,database_user,database_password))
          jars = ["/tmp/mysql-connector-java-5.1.45.jar"]
          libs = None
          cnx = jaydebeapi.connect(jclassname, url, jars=jars,
                            libs=libs)   
          logging.info('Connection Successful..') 
          cursor = cnx.cursor()
          logging.info('Reading Sql Query from the file...')
          query = 'select * from employees.employee_details'
          logging.info('Query is %s',query)
          logging.info('Query submitted to Database..')

          for chunk in pd.read_sql(query, cnx, coerce_float=True, params=None, parse_dates=None, columns=None,chunksize=500000):
                 chunk.apply(lambda x: x.replace(u'\r', u' ').replace(u'\n', u' ') if isinstance(x, str) or isinstance(x, unicode) else x).WriteToText('gs://first-bucket-arpan/output2/')
    
          logging.info("Load completed...")
          return list("1")      
          
def run():  
    try:
        
        pcoll = beam.Pipeline()
        dummy= pcoll | 'Initializing..' >> beam.Create(['1'])
        logging.info('inside run 1')
        dummy_env = dummy | 'Setting up Instance..' >> beam.ParDo(setenv())
        logging.info('inside run 2')
        readrecords=(dummy_env | 'Processing' >>  beam.ParDo(readfromdatabase()))
        logging.info('inside run 3')
        p=pcoll.run()
        logging.info('inside run 4')
        p.wait_until_finish()
    except:
        logging.exception('Failed to launch datapipeline')
        raise 

def main():
    logging.getLogger().setLevel(logging.INFO)
    GOOGLE_APPLICATION_CREDENTIALS="gs://<MY_BUCKET>/My First Project-800a97e1fe65.json"
    run()

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  
  main()

I'm running using below command: python /home/aarpan_roy/readfromdatabase.py --region $REGION --runner DataflowRunner --project $PROJECT --temp_location gs://$BUCKET/tmp

While running, betting below output and no Dataflow job is being created:

INFO:root:inside run 2
INFO:root:inside run 3
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function annotate_downstream_side_inputs at 0x7fc1ea8a27d0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function fix_side_input_pcoll_coders at 0x7fc1ea8a28c0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function lift_combiners at 0x7fc1ea8a2938> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function expand_sdf at 0x7fc1ea8a29b0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function expand_gbk at 0x7fc1ea8a2a28> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sink_flattens at 0x7fc1ea8a2b18> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function greedily_fuse at 0x7fc1ea8a2b90> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function read_to_impulse at 0x7fc1ea8a2c08> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function impulse_to_input at 0x7fc1ea8a2c80> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sort_stages at 0x7fc1ea8a2e60> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function setup_timer_mapping at 0x7fc1ea8a2de8> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function populate_data_channel_coders at 0x7fc1ea8a2ed8> ====================
INFO:apache_beam.runners.worker.statecache:Creating state cache with size 100
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Created Worker handler <apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler object at 0x7fc1ea356ad0> for environment ref_Environment_default_environment_1 (beam:env:embedded_python:v1, '')
INFO:apache_beam.runners.portability.fn_api_runner.fn_runner:Running (ref_AppliedPTransform_Initializing../Impulse_3)+((ref_AppliedPTransform_Initializing../FlatMap(<lambda at core.py:2632>)_4)+((ref_AppliedPTransform_Initializing../Map(decode)_6)+((ref_AppliedPTransform_Setting up Instance.._7)+(ref_AppliedPTransform_Processing_8))))
Copying gs://<MY_BUCKET>/mysql-connector-java-5.1.45.jar...
- [1 files][976.4 KiB/976.4 KiB]
Operation completed over 1 objects/976.4 KiB.
INFO:root:Enviornment Variable set.
INFO:root:inside run 4

Kindly help me resolving the issue and also guide how to extract data from MySql to GCP Bucket using Apache Beam Python API.

Thanks in advance.

================================================================================= Hello All, I changed the code a bit and running it from a shell script after exporting GOOGLE_APPLICATION_CREDENTIALS. But, when it is executing, getting below error:

WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: File /home/aarpan_roy/script/dataflowerviceaccount.json (pointed by GOOGLE_APPLICATION_CREDENTIALS environment variable) does not exist!
Connecting anonymously.
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
ERROR:root:Failed to launch datapipeline"

Below is the total log file:


    aarpan_roy@my-dataproc-cluster-m:~/script/util$ sh -x dataflow_runner.sh
    + export GOOGLE_APPLICATION_CREDENTIALS=gs://<MY_BUCKET>/My First Project-800a97e1fe65.json
    + python /home/aarpan_roy/script/util/loadfromdatabase.py --config config.properties --productconfig cts.properties --env dev --sourcetable employee_details --sqlquery /home/aarpan_roy/script/sql/employee_details.sql --connectionprefix d3 --incrementaldate 1900-01-01
    /home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/__init__.py:82: UserWarning: You are using Apache Beam with Python 2. New releases of Apache Beam will soon support Python 3 only.
      'You are using Apache Beam with Python 2. '
    /home/aarpan_roy/script/util/
    INFO:root:Job Run Id is 22152
    INFO:root:Job Name is load-employee-details-20200820
    SELECT * FROM EMPLOYEE_DETAILS
    WARNING:apache_beam.options.pipeline_options_validator:Option --zone is deprecated. Please use --worker_zone instead.
    INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
    INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpEarTQ0', 'apache-beam==2.23.0', '--no-deps', '--no-binary', ':all:']
    INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI: dataflow_python_sdk.tar
    INFO:apache_beam.runners.portability.stager:Downloading binary distribution of the SDK from PyPi
    INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpEarTQ0', 'apache-beam==2.23.0', '--no-deps', '--only-binary', ':all:', '--python-version', '27', '--implementation', 'cp', '--abi', 'cp27mu', '--platform', 'manylinux1_x86_64']
    INFO:apache_beam.runners.portability.stager:Staging binary distribution of the SDK from PyPI: apache_beam-2.23.0-cp27-cp27mu-manylinux1_x86_64.whl
    WARNING:root:Make sure that locally built Python SDK docker image has Python 2.7 interpreter.
    INFO:root:Using Python SDK docker image: apache/beam_python2.7_sdk:2.23.0. If the image is not available at local, we will try to pull from hub.docker.com
    INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
    INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
    WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: File /home/aarpan_roy/script/dataflowerviceaccount.json (pointed by GOOGLE_APPLICATION_CREDENTIALS environment variable) does not exist!
    Connecting anonymously.
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pipeline.pb...
    INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
    INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
    INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pipeline.pb in 0 seconds.
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pickled_main_session...
    INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pickled_main_session in 0 seconds.
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/dataflow_python_sdk.tar...
    INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/dataflow_python_sdk.tar in 0 seconds.
    INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/apache_beam-2.23.0-cp27-cp27mu-manylinux1_x86_64.whl...
    INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/apache_beam-2.23.0-cp27-cp27mu-manylinux1_x86_64.whl in 0 seconds.
    WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
    WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
    ERROR:root:Failed to launch datapipeline
    Traceback (most recent call last)
    File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 105, in run
        p=pcoll.run()
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 521, in run
        allow_proto_holders=True).run(False)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 534, in run
        return self.runner.run_pipeline(self, self._options)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 586, in run_pipeline
        self.dataflow_client.create_job(self.job), self)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
        return fun(*args, **kwargs)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 681, in create_job
        return self.submit_job_description(job)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
        return fun(*args, **kwargs)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 748, in submit_job_description
        response = self._client.projects_locations_jobs.Create(request)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 667, in Create
        config, request, global_params=global_params)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
        return self.ProcessHttpResponse(method_config, http_response, request)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
        self.__ProcessHttpResponse(method_config, http_response, request))
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
        http_response, method_config=method_config, request=request)
    HttpForbiddenError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/turing-thought-277215/locations/asia-southeast1/jobs?alt=json>: response: <{'status': '403', 'content-length': '138', 'x-xss-protection': '0', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Thu, 20 Aug 2020 09:42:47 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8', 'www-authenticate': 'Bearer realm="https://accounts.google.com/", error="insufficient_scope", scope="https://www.googleapis.com/auth/compute.readonly https://www.googleapis.com/auth/compute https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.email email https://www.googleapis.com/auth/userinfo#email"'}>, content <{
      "error": {
        "code": 403,
        "message": "Request had insufficient authentication scopes.",
        "status": "PERMISSION_DENIED"
      }
    }
    >
    Traceback (most recent call last):
      File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 194, in <module>
        args.sourcetable)
      File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 142, in main
        run()  
      File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 105, in run
        p=pcoll.run()
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 521, in run
        allow_proto_holders=True).run(False)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 534, in run
        return self.runner.run_pipeline(self, self._options)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 586, in run_pipeline
        self.dataflow_client.create_job(self.job), self)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
        return fun(*args, **kwargs)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 681, in create_job
        return self.submit_job_description(job)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
        return fun(*args, **kwargs)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 748, in submit_job_description
        response = self._client.projects_locations_jobs.Create(request)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 667, in Create
        config, request, global_params=global_params)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
        return self.ProcessHttpResponse(method_config, http_response, request)
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
        self.__ProcessHttpResponse(method_config, http_response, request))
      File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
        http_response, method_config=method_config, request=request)
    apitools.base.py.exceptions.HttpForbiddenError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/turing-thought-277215/locations/asia-southeast1/jobs?alt=json>: response: <{'status': '403', 'content-length': '138', 'x-xss-protection': '0', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Thu, 20 Aug 2020 09:42:47 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8', 'www-authenticate': 'Bearer realm="https://accounts.google.com/", error="insufficient_scope", scope="https://www.googleapis.com/auth/compute.readonly https://www.googleapis.com/auth/compute https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.email email https://www.googleapis.com/auth/userinfo#email"'}>, content <{
      "error": {
        "code": 403,
        "message": "Request had insufficient authentication scopes.",
        "status": "PERMISSION_DENIED"
      }
    }
    >
    + echo 1
    1

I'm not able to understand where I'm making the mistake. Kindly help me to resolve the issue. Thanks in advance.

According to the error message, it seems that the service account that you are using doesn't have sufficient permissions. Please verify that this service account has the [Dataflow.admin role](https://cloud.google.com/dataflow/docs/concepts/access-control#roles). This role is the Minimal role for creating and managing dataflow jobs. — Orlandog, Aug 20 '20 at 15:22
At the same time, please provide the content of your shell script ‘dataflow_runner.sh’. For security reasons, please consider masking the sensitive information in your code. — Orlandog, Aug 20 '20 at 15:23
Hi Orlandog, I gave the service account as "dataflow Admin role" and dataflow worker role. Also I enables "Full access to all API" for that account. Also I copied the service account json file in home directory and exported GOOGLE_APPLICATION_CREDENTIALS in a wrapper shell script. From the same shell script I'm execution python script. — Arpan Roy, Aug 23 '20 at 18:04

score 0 · Answer 1 · answered Aug 23 '20 at 17:34

0

I think that there might be a spelling mistake in your account credentials

/home/aarpan_roy/script/dataflowerviceaccount.json should be /home/aarpan_roy/script/dataflowserviceaccount.json. Can you check this ?

answered Aug 23 '20 at 17:34

Jayadeep Jayaraman

2,747
3
15
26

Orlandog · Answer 2 · 2020-09-18T14:40:22.793

There seem to be two issues here:

On the one hand, the authentication issue, please perform the test provided by @Jayadeep to verify if it was an issue with the name of the credentials.

In the same way, there is probably an issue with the handling of credentials because you are running your code in a Dataproc instance, so I recommend you test your code in the Cloud Shell
On the other hand, I found another post where it is mentioned that the connection to Cloud SQL in python seems to be not as transparent as it is with Java (using jdbcIO).

At the same time, I found this another post where is mentioned a workaround to connect to cloud SQL but using psycopg2 instead of jaydebeapi. I recommend you try it.

import psycopg2

connection = psycopg2.connect( 
    host = host, 
    hostaddr = hostaddr, 
    dbname = dbname, 
    user = user, 
    password = password, 
    sslmode=sslmode, 
    sslrootcert = sslrootcert, 
    sslcert = sslcert, 
    sslkey = sslkey
)

Reading data from MySQL and writing in GCP Bucket using Apache Beam Python API

2 Answers2