2

Trying to submit a sqoop job to dataproc to export data from a postgres database following the article: https://medium.com/google-cloud/migrate-oracle-data-to-bigquery-using-dataproc-and-sqoop-cd3863adde7b

It is erroring out with: org.postgresql.util.PSQLException: SSL error: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.

This is the command I am trying to submit (variables have been appropriately set):

gcloud dataproc jobs submit hadoop --cluster=sqoop-cluster --region=us-central1 --class=org.apache.sqoop.Sqoop --jars=$libs -- import -Dmapreduce.job.user.classpath.first=true -Dorg.apache.sqoop.splitter.all
ow_text_splitter=true --connect=$JDBC_STR --username=xxx --password=xxxx--driver=org.postgresql.Driver --target-dir=$STAGING_BUCKET/$TABLE --table=$SCHEMA.$TABLE --enclosed-by
 '\"' --escaped-by \" --fields-terminated-by '|' --null-string '' --null-non-string '' --as-textfile

The postgres jdbc connection string is as follows (omitting ssl=true throws hba_conf not found):

JDBC_STR=jdbc:postgresql://xxxxx:5432/YYYY?ssl=true        

The detailed error:

Job [63fb49544a1141f89f9a12960cc18e18] submitted.
Waiting for job output...
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2400: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNCLASSSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2365: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNCLASSSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2460: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNCLASSSHIM_OPTS: invalid variable name
2021-10-14 21:48:33,931 WARN tool.SqoopTool: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2021-10-14 21:48:34,128 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2021-10-14 21:48:34,156 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2021-10-14 21:48:34,176 WARN sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2021-10-14 21:48:34,203 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.Gene
ricJdbcManager. Please specify explicitly which connection manager should be used next time.
2021-10-14 21:48:34,217 INFO manager.SqlManager: Using default fetchSize of 1000
2021-10-14 21:48:34,217 INFO tool.CodeGenTool: Beginning code generation
2021-10-14 21:48:34,504 ERROR manager.SqlManager: Error executing statement: org.postgresql.util.PSQLException: SSL error: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.
org.postgresql.util.PSQLException: SSL error: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.
        at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:64)

Any help is appreciated.

Thanks!

sacoder
  • 159
  • 13

1 Answers1

1

Seems that your PostgreSQL server has SSL enabled, but the client side (Dataproc VMs) are not configured with the server certificate or its root CA.

  1. With ssl=true the client side will verify the server certificate, you can use a Dataproc init action to import the server certificate to Dataproc VMs:
gsutil cp gs://<my-bucket>/server.crt .

# If `JAVA_HOME` is not defined, try `/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64`.
keytool -keystore $JAVA_HOME/lib/security/cacerts -alias postgresql -import -file server.crt
  1. If you don't want to verify server certificate on the client side, instead you want the server to verify the client hostname/IP and certificate, configure your server, and use sslmode=require in the connection string.

  2. For a quick test with server certification verification disabled on the client side, try this in the JDBC connection string:

?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory

See this doc for more information on configuring SSL for PostgreSQL. Also a similar question for reference.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • This was very helpful and pointed me to the right place, I used the sslmode=require instead of ssl=true and that fixed the issue. I did not have to do anything else on dataproc. – sacoder Oct 15 '21 at 14:45
  • So basically you are disabling server certificate verification on the client side but verify the client certificate on the server side? – Dagang Oct 15 '21 at 16:51
  • Could you accept the answer if it is helpful, thanks! – Dagang Oct 15 '21 at 16:52