PySpark via Dataproc + SSL Connection to Cloud SQL

Question

I have a Cloud SQL instance storing data in a database, and I have checked the option for this Cloud SQL instance to block all unencrypted connections. When I select this option, I am given three SSL certificates - a server certificate, a client public key, and a client private key (three separate .pem files) (link to relevant CloudSQL+SSL documentation). These certificate files are used to establish encrypted connections to the Cloud SQL instance.

I'm able to successfully connect to Cloud SQL using MySQL from the command line using the --ssl-ca, --ssl-cert, and --ssl-key options to specify the server certificate, client public key, and client private key, respectively:

mysql -uroot -p -h <host-ip-address> \
--ssl-ca=server-ca.pem \
--ssl-cert=client-cert.pem \
--ssl-key=client-key.pem

I am now trying to run a PySpark job that connects to this Cloud SQL instance to extract the data to analyze it. The PySpark job is basically the same as this example provided by Google Cloud training team. On line 39 of said script, there is a JDBC connection that is made to the Cloud SQL instance:

jdbcDriver = 'com.mysql.jdbc.Driver'
jdbcUrl = 'jdbc:mysql://%s:3306/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)

but this does not make an encrypted connection and does not provide the three certificate files. If I have unencrypted connections to the Cloud SQL instance disabled, I see the following error message:

17/09/21 06:23:21 INFO org.spark_project.jetty.util.log: Logging initialized @5353ms
17/09/21 06:23:21 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
17/09/21 06:23:21 INFO org.spark_project.jetty.server.Server: Started @5426ms
17/09/21 06:23:21 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@74af54ac{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

[...snip...]

py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: java.sql.SQLException: Access denied for user 'root'@'<cloud-sql-instance-ip>' (using password: YES)

whereas if I have unencrypted connections to the Cloud SQL instance enabled, the job runs just fine. (This indicates that the issue is not with Cloud SQL API permissions - the cluster I'm running the PySpark job from definitely have permission to access the Cloud SQL instance.)

The JDBC connection strings I have found involving SSL add a &useSSL=true or &encrypt=true but do not point to external certificates; or, they use a keystore in some kind of Java-specific procedure. How can I modify the JDBC connection string from the Python script linked to above, in order to point JDBC (via PySpark) to the locations of the server certificate and client public/private keys (server-ca.pem, client-cert.pem, and client-key.pem) on disk?

Have you looked at https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory? It tunnels the entire connection using ephemeral SSL certs and avoids the need to manually manage certs. — Vadim, Sep 21 '17 at 17:36
PySpark is using an [SQL context](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/CPB100/lab3b/sparkml/train_and_apply.py#L25) from the pyspark library - the library you linked to is Java, so I wouldn't be able to use it. But the idea is good. From what I can tell, [Cloud SQL Proxy](https://cloud.google.com/sql/docs/mysql/connect-admin-proxy) is doing something like what this library does, but with a command line utility. — charlesreid1, Sep 21 '17 at 18:57
As far as I can tell the connection happens using Java even if you are using python to write your jobs so the link I gave you still applies. Also see: https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark — Vadim, Sep 21 '17 at 19:00
Thanks for that last link. I see where jars can be specified when running a Dataproc PySpark job, and how I would modify the JDBC connection string to specify using the socket factory. But what is the jar in this case - just cloud-sql-jdbc-socket-factory library built/bundled into a jar? — charlesreid1, Sep 21 '17 at 19:23

Dennis Huo · Accepted Answer · 2017-09-23T03:29:24.667

0

There's a handy initialization action for configuring the CloudSQL Proxy on Dataproc clusters. By default it assumes you intend to use CloudSQL for the Hive metastore, but if you download it and customize it setting ENABLE_CLOUD_SQL_METASTORE=0 and then re-upload it into your own bucket to use as your custom initialization action, then you should automatically get the CloudSQL proxy installed on all your nodes. And then you just set your mysql connection string to point to localhost instead of the real CloudSQL IP.

When specifying the metadata flags, if you've disabled additional-cloud-sql-instances instead of hive-metastore-instance in your metadata:

--metadata "additional-cloud-sql-instances=<PROJECT_ID>:<REGION>:<ANOTHER_INSTANCE_NAME>=tcp<PORT_#>`

In this case you can optionally use the same port assignment the script would've used by default for the metastore, which is port 3306.

edited Sep 23 '17 at 03:29

answered Sep 21 '17 at 20:55

Dennis Huo

10,517
27
43

Great, this was exactly what I was looking for. One remaining question - who supplies the credentials? Are the credentials from the Dataproc instances (assuming they have permission to access the Cloud SQL API) automatically used? – charlesreid1 Sep 21 '17 at 22:36
Correct, the VM-mediated service-account credentials are used by default automatically, so you need to make sure you enabled the CloudSQL scopes on the cluster. In theory it's also possible to use separate credential keyfiles, but that's not too well explored especially for that init action. – Dennis Huo Sep 22 '17 at 01:17
No luck getting this working - the ADDITIONAL_INSTANCES variable is always empty, so the [test on line 63](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/cloud-sql-proxy/cloud-sql-proxy.sh#L63) always fails. What is ADDITIONAL_INSTANCES supposed to contain, and why is it required, if the Cloud SQL instance information is already provided in METASTORE_INSTANCE? – charlesreid1 Sep 23 '17 at 01:47
Ah sorry forgot that detail; we should add an example on the README; ADDITIONAL_INSTANCES is mostly just the same thing you would've put into METASTORE_INSTANCE, except that it can be a list. The distinction is just that the author of that script felt it would be a misnomer to say "METASTORE_INSTANCE" for something that isn't going to be used as a Hive Metastore. So to answer, simply leave METASTORE_INSTANCE blank, and specify your target CloudSQL instance inside the ADDITIONAL_INSTANCES in the form `additional-cloud-sql-instances=::=tcp` – Dennis Huo Sep 23 '17 at 03:25
1

Updated the answer to include this info – Dennis Huo Sep 23 '17 at 03:29

PySpark via Dataproc + SSL Connection to Cloud SQL

1 Answers1