Distribute third-party jar dependency on large-scale spark application

Question

We have a third-party jar file on which our Spark application is dependent that. This jar file size is ~15MB. Since we want to deploy our Spark application on a large-scale cluster(~500 workers), we are concerned about distributing the third-party jar file. According to the Apache Spark documentation(https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management), we have such options as using HDFS, HTTP server, driver HTTP server, and local-path for distributing the file.

We do not prefer to use local-path because it requires copying the jar file on all workers' spark libs directory. On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?

Why don'y use local-path? What problem with copying to all nodes? — Egor, Oct 17 '21 at 16:56

score 0 · Answer 1 · answered Oct 18 '21 at 06:54

0

On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?

If you put the 3rd jar in HDFS why it effect the spark driver server?! each node should take the addintiol jar directly from the hdfs not from the spark server.

answered Oct 18 '21 at 06:54

y. bs

350
3
14

Yes, you are right. But the problem is when you have ~1000 workers when the application submit, spark workers try to fetch the jar from HDFS. If the replication factor is small again, we face network overwhelming for data nodes that include the replication. – mrr Nov 30 '21 at 09:40
what about Ubar jar? (https://stackoverflow.com/a/49811665/8927135) – y. bs Nov 30 '21 at 13:51

score 0 · Accepted Answer · answered Nov 30 '21 at 09:49

After examining different proposed methods, we found out that the best way to distribute the jar files is to copy all nodes(as @egor mentioned, too.) based on our deployment tools, we released a new spark ansible role that supports external jars. At this moment, Apache Spark does not provide any optimized solution. If you give a remote URL (HTTP, HTTPS, HDFS, FTP) as an external jar path, spark workers fetch the jar file every time a new job submit. So, it is not an optimized solution from a network perspective.

Distribute third-party jar dependency on large-scale spark application

2 Answers2