How to specify multiple dependencies using --packages for spark-submit?

Question

I have the following as the command line to start a spark streaming job.

    spark-submit --class com.biz.test \
            --packages \
                org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
                org.apache.hbase:hbase-common:1.0.0 \
                org.apache.hbase:hbase-client:1.0.0 \
                org.apache.hbase:hbase-server:1.0.0 \
                org.json4s:json4s-jackson:3.2.11 \
            ./test-spark_2.10-1.0.8.jar \
            >spark_log 2>&1 &

The job fails to start with the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
    at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
    at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
    at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.

According to the docs (spark-submit --help):

The format for the coordinates should be groupId:artifactId:version.

So what I have should be valid and should reference this package.

If it helps, I'm running Cloudera 5.4.4.

What am I doing wrong? How can I reference the hbase packages correctly?

Is it working fine? In my case I had to add jars via --jars and --driver-class-path also. — Thomas Decaux, Oct 20 '16 at 17:49

zero323 · Accepted Answer · 2016-01-28T17:56:26.113

62

A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example

--packages  org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
  org.apache.hbase:hbase-common:1.0.0

edited Jan 28 '16 at 17:56

answered Nov 25 '15 at 23:15

zero323

322,348
103
959
935

9

I found I also had to remove the spaces and line breaks in order to get it to work successfully: `--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,org.apache.hbase:hbase-common:1.0.0`... – davidpricedev Nov 25 '15 at 23:33

score 5 · Answer 2 · answered Oct 14 '20 at 09:47

5

I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()

answered Oct 14 '20 at 09:47

Mohammad Aqajani

169
3
8

i never heard of `spark.jars.packages` before and I was a top-end developer (including multiple `spark-sql` and `mllib` contribs) from 2014 to 2019 – WestCoastProjects Dec 06 '22 at 06:08

score 1 · Answer 3 · edited Sep 04 '21 at 08:51

1

@Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:

spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())

edited Sep 04 '21 at 08:51

ouflak

2,458
10
44
49

answered Sep 04 '21 at 04:44

Prasanna Josium

53
6

How to specify multiple dependencies using --packages for spark-submit?

3 Answers3