22

I would like to run spark-shell with a external package behind a corporate proxy. Unfortunately external packages passed via --packages option are not resolved.

E.g., when running

bin/spark-shell --packages datastax:spark-cassandra-connector:1.5.0-s_2.10

the cassandra connector package is not resolved (stuck at last line):

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
datastax#spark-cassandra-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]

After some time the connection times out containing error messages like this:

:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/datastax/spark-cassandra-connector/1.5.0-s_2.10/spark-cassandra-connector-1.5.0-s_2.10.pom (java.net.ConnectException: Connection timed out)

When i deactivate the VPN with the corporate proxy the package gets resolved and downloaded immediately.

What i tried so far:

Exposing proxies as environment variables:

export http_proxy=<proxyHost>:<proxyPort>
export https_proxy=<proxyHost>:<proxyPort>
export JAVA_OPTS="-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>"
export ANT_OPTS="-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>"

Running spark-shell with extra java options:

bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>" --conf "spark.executor.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>" --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10

Is there some other configuration possibility i am missing?

mtsz
  • 2,725
  • 7
  • 28
  • 41

7 Answers7

31

Found the correct settings:

bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>" --packages <somePackage>

Both http and https proxies have to be set as extra driver options. JAVA_OPTS does not seem to do anything.

mtsz
  • 2,725
  • 7
  • 28
  • 41
  • 2
    Do you know how to do this with spark-submit --master yarn --deploy-mode cluster? – angelcervera Dec 14 '16 at 09:15
  • For me on spark2-submit this worked '--driver-java-options "-Dhttps.proxyHost=httpproxy -Dhttps.proxyPort=80"' – A. Rabus Aug 23 '17 at 09:34
  • Hi @mtsz, I still have problem. I have configured the command with one of proxy which exists on [this site](https://free-proxy-list.net/). But it doesn't solve my problem. I don't know what to do. Would you please guide me to solve my problem? [my stackoverflow question](https://stackoverflow.com/questions/57744072/how-to-run-scala-code-in-spark-container-using-docker) and [my github issue](https://github.com/sindbach/mongodb-spark-docker/issues/5) – Mostafa Ghadimi Sep 03 '19 at 11:53
  • @MostafaGhadimi were you able to resolve this issue because I am also facing the same issue. – abheet22 Feb 26 '20 at 12:58
  • @abheet22 I use some different jar files instead of the previous one. – Mostafa Ghadimi Feb 26 '20 at 15:30
12

If proxy is correctly configured on your OS, you can use the java property: java.net.useSystemProxies:

--conf "spark.driver.extraJavaOptions=-Djava.net.useSystemProxies=true"

so proxy host / port and no-proxy hosts will be configured.

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
5

This worked for me in spark 1.6.1:

bin\spark-shell --driver-java-options "-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>" --packages <package>
Ian Jones
  • 1,998
  • 1
  • 17
  • 17
2

Was struggling with pyspark till I found this:

Adding on to @Tao Huang's answer:

bin/pyspark --driver-java-options="-Dhttp.proxyUser=user -Dhttp.proxyPassword=password -Dhttps.proxyUser=user -Dhttps.proxyPassword=password -Dhttp.proxyHost=proxy -Dhttp.proxyPort=port -Dhttps.proxyHost=proxy -Dhttps.proxyPort=port" --packages [groupId:artifactId]

I.e. should be -Dhttp(s).proxyUser instead of ...proxyUsername

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
chaooder
  • 1,358
  • 1
  • 17
  • 37
1

Add

spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>

to $SPARK_HOME/conf/spark-defaults.conf works for me.

1

If you need authentication to use proxy, you can use below in default conf file:

spark.driver.extraJavaOptions  -Dhttp.proxyHost= -Dhttp.proxyPort= -Dhttps.proxyHost= -Dhttps.proxyPort= -Dhttp.proxyUsername= -Dhttp.proxyPassword= -Dhttps.proxyUsername= -Dhttps.proxyPassword= 
pgruetter
  • 1,184
  • 1
  • 11
  • 29
Tao Huang
  • 21
  • 4
0

On windows 7 with spark-2.0.0-bin-hadoop2.7 I set the spark.driver.extraJavaOptions in %SPARK_HOME%"\spark-2.0.0-bin-hadoop2.7\conf\spark-defaults.conf like:

spark.driver.extraJavaOptions -Dhttp.proxyHost=hostname -Dhttp.proxyPort=port -Dhttps.proxyHost=host -Dhttps.proxyPort=port
Grady G Cooper
  • 1,044
  • 8
  • 19