4

I'm trying to move data from a Data Catalog table (MySQL) through AWS Glue (in visual mode of GlueStudio) into a snowflake table.

For this, I'm following this guide Performing data transformations using Snowflake and AWS Glue [1]

I'm following every part of it, but when executing my job, I get the error

An error occurred while calling xxx.pyWriteDynamicFrame. No suitable driver

(yes, other than the stack trace, there's no more info in the error message) I've tested everything I could think of, like:

  • Having access to the s3 bucket of the driver
  • Having network access
  • I've Tried an incomplete JDBC ULR (no password) and the error says so
  • I've tried giving the wrong password, and I get the appropriate error

One thing I've found is that a lot of issues I've found are reported on AWS Glue as a script (not for the visual editor) and in many of them, they're using two jars: the snowflake JDBC driver and the Snowflake Spark Connector. Even though the tutorial I followed[1] isn't that clear, I tried having both files in my 'drivers' bucket, but still the same error.

I've tried many versions of both files to no avail.

So I have no idea what that might be (I even tried on a different AWS account and a different Snowflake account so I could have full access to resources)

Did any of you have tried this setup?

I'm using:

  • AWS GlueStudio (June 2021)
  • Snowflake cloud version 5.22.1 (got that with select CURRENT_VERSION(); in Snowflake)
  • Snowflake JDBC driver v3.13.4
  • Snowflake spark connector v2.9.0-spark_2.4 for scala v2.12
  • My connection string: jdbc:snowflake://xxx00000.snowflakecomputing.com/?user=${Username}&password=${Password}&warehouse=${wh}&db=${db}&schema=${schema}

[Edit Jun 23] As for @jonlegend comment: I'm using the visual editor for this job. So I'm not in control of the code implementation. Nevertheless, I'll post the code generated:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "mydb_name", table_name = "mytable_name", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydb_name", table_name = "mytable_name", transformation_ctx = "DataSource0")
## @type: ApplyMapping
## @args: [mappings = [("account_number", "string", "account_number", "string"), ("user_id", "int", "user_id", "int"), ("description", "string", "description", "string"), ("id", "int", "id", "int"), ("group_account_id", "int", "group_account_id", "int"), ("updated", "timestamp", "updated", "timestamp")], transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [frame = DataSource0]
Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("account_number", "string", "account_number", "string"), ("user_id", "int", "user_id", "int"), ("description", "string", "description", "string"), ("id", "int", "id", "int"), ("group_account_id", "int", "group_account_id", "int"), ("updated", "timestamp", "updated", "timestamp")], transformation_ctx = "Transform0")
## @type: DataSink
## @args: [connection_type = "custom.jdbc", connection_options = {"dbTable":"myschema.myTargetTable","connectionName":"snowflake-connection-v7"}, transformation_ctx = "DataSink0"]
## @return: DataSink0
## @inputs: [frame = Transform0]
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "custom.jdbc", connection_options = {"dbTable":"myschema.myTargetTable","connectionName":"snowflake-connection-v7"}, transformation_ctx = "DataSink0")
job.commit()

Also, about the stacktrace, I can share it too:

ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/test_job_v7.py", line 30, in <module>
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "custom.jdbc", connection_options = 
{
    "dbTable": "myschema.myTargetTable",
    "connectionName": "snowflake-connection-v7"
}
, transformation_ctx = "DataSink0")
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 653, in from_options
    format_options, transformation_ctx)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 281, in write_dynamic_frame_from_options
    format, format_options, transformation_ctx)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 304, in write_from_options
    return sink.write(frame_or_dfc)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/data_sink.py", line 35, in write
    return self.writeFrame(dynamic_frame_or_dfc, info)
  File "/opt/amazon/lib/python3.6/site-packages/awsglue/data_sink.py", line 31, in writeFrame
    return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame.name + "_errors")
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o104.pyWriteDynamicFrame.
: java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:315)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:104)
    at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
    at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
    at com.amazonaws.services.glue.marketplace.partner.PartnerJDBCRecordWriterFactory.<init>(PartnerJDBCDataSink.scala:78)
    at com.amazonaws.services.glue.marketplace.partner.PartnerJDBCDataSink.createWriterFactory(PartnerJDBCDataSink.scala:32)
    at com.amazonaws.services.glue.marketplace.partner.PartnerJDBCDataSink.createWriterFactory(PartnerJDBCDataSink.scala:23)
    at com.amazonaws.services.glue.marketplace.connector.GlueCustomDataSink.defaultWriteDynamicFrame(CustomDataSink.scala:68)
    at com.amazonaws.services.glue.marketplace.connector.GlueCustomDataSink.writeDynamicFrame(CustomDataSink.scala:61)
    at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Finally, I'm looking at the error logs AND the Job logs, and the error is the same. Previous messages in both logs don't help either.

Daniel Salcedo
  • 322
  • 3
  • 15
  • Review the following article which gives a sample implementation for AWS Glue + Snowflake : https://www.snowflake.com/blog/how-to-use-aws-glue-with-snowflake/ – Srinath Menon Jun 25 '21 at 04:21
  • @Srinath I've already tried that method. It uses Dataframes but the code generated by Glue Studio uses DynamicFrames for loading the data from the catalog – Daniel Salcedo Jun 28 '21 at 12:14

1 Answers1

0

A few suggestions:
Ensure your JDBC driver is referenced in your code (I am not sure how to do this in the visual editor, but in the code, change the following line:

DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "custom.jdbc", connection_options = {"dbTable":"myschema.myTargetTable","connectionName":"snowflake-connection-v7"}, transformation_ctx = "DataSink0")

to:

DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "custom.jdbc", connection_options = { 
  "dbTable":"myschema.myTargetTable",
  "connectionName":"snowflake-connection-v7",
  "customJdbcDriverS3Path": "Amazon S3 path of the custom JDBC driver",
  "customJdbcDriverClassName":"class name of the driver"
}, transformation_ctx = "DataSink0")

Also ensure that your glue job has iam permissions to the S3 folder where the driver is located. If you are using the default service-role/AWSGlueServiceRole, just ensure that the string "aws-glue-" appears somewhere in the s3 path, e.g. "S3://somebucket/aws-glue-drivers/mydriver.jar"

jonlegend
  • 361
  • 2
  • 6
  • Thanks for the answer! Unfortunately, It didn't work. I'm getting the same error code with both referencing the driver and the name of the folder in the bucket. I've tried other scripts and I found that the data structure used by Glue studio is a DynamicFrame, while other tutorials use DataFrames. That's what I'm looking into right now, but still no success. – Daniel Salcedo Jun 28 '21 at 13:13