2

I'm running PyCharm 2018.2 on a Mac and executing a pyspark program. Spark was installed in the virtualenv.

I need to use external jars (specifically, the AWS s3 jars) in my pyspark script, so I use the following to declare the maven dependancy:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
conf = SparkConf() \
    .setMaster("local[2]") \
    .setAppName("pyspark-unittests") \
    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf=conf)

On my home network, this works great.

On my corporate network, there is an SSL inspector between me and the internet that swaps the SSL certificates on the HTTPS requests to Maven Central.

This results in the below error message:

Server access error at url https://repo1.maven.org/maven2/joda-time/joda-time/maven-metadata.xml (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)

I know this is because the SSL certificates that the SSL inspector signed the HTTPS response with are not trusted by the JVM executed by pyspark.

I have .cer copies of the intermediate certificates that are signing the HTTPS response.

Which JVM is being used in this specific case (python/pyspark running in PyCharm) and how can I update the certificates in that JVM's trust store?

Jared
  • 25,520
  • 24
  • 79
  • 114
  • The package uses your system java, so if maven works in you office environment then this also should work? – Tarun Lalwani May 09 '18 at 06:23
  • There are multiple jdk's installed on my local machine. Any idea how I can determine which one the package is using? – Jared May 09 '18 at 15:43
  • 1
    Yes. Run the program in a debugger and pause it after `sc = SparkContext(conf = conf)` and then run `ps aux | grep java` and which jvm is getting use for this – Tarun Lalwani May 09 '18 at 15:47
  • Thanks for the help. That shoud've been obvious ;). You want to toss a quick answer up here so I can accept and give you rep - want me to self-answer? – Jared May 09 '18 at 17:24

1 Answers1

1

Run the program in a debugger and pause it after sc = SparkContext(conf = conf) and then run ps aux | grep java and you will know which jvm is getting used for this

Then you should setup the keystore for the same. After you know the correct JDK, use below link to setup the keyStore

How to properly import a selfsigned certificate into Java keystore that is available to all Java applications by default?

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • Once I had the correct JDK, this question shows the steps to import to the keytool: https://stackoverflow.com/questions/11617210/how-to-properly-import-a-selfsigned-certificate-into-java-keystore-that-is-avail – Jared May 09 '18 at 18:14
  • Thanks for the link, updated the answer with the same. Hope your other query on token also got solved? – Tarun Lalwani May 09 '18 at 18:17