I'm running PyCharm 2018.2 on a Mac and executing a pyspark program. Spark was installed in the virtualenv.
I need to use external jars (specifically, the AWS s3 jars) in my pyspark script, so I use the following to declare the maven dependancy:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
conf = SparkConf() \
.setMaster("local[2]") \
.setAppName("pyspark-unittests") \
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf=conf)
On my home network, this works great.
On my corporate network, there is an SSL inspector between me and the internet that swaps the SSL certificates on the HTTPS requests to Maven Central.
This results in the below error message:
Server access error at url https://repo1.maven.org/maven2/joda-time/joda-time/maven-metadata.xml (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
I know this is because the SSL certificates that the SSL inspector signed the HTTPS response with are not trusted by the JVM executed by pyspark.
I have .cer copies of the intermediate certificates that are signing the HTTPS response.
Which JVM is being used in this specific case (python/pyspark running in PyCharm) and how can I update the certificates in that JVM's trust store?