0

I have noticed that in my project there are 2 ways of running spark jobs.

  1. First way is submitting a job to spark-submit file

    ./bin/spark-submit
    --class org.apache.spark.examples.SparkPi
    --master local[8]
    /path/to/examples.jar
    100

  2. Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:

    hadoop jar JarFile.jar MainClassName

` What is the difference between these 2 ways? Which prerequisites I need to have in order to use either?

MiamiBeach
  • 3,261
  • 6
  • 28
  • 54
  • I believe it's the kind of JARs that will be added to classpath. `hadoop jar` will only add hadoop related JARs to classpath while executing the JAR whilst `spark-submit` will add spark core, sql as well as hadoop related JARs as well. – philantrovert Nov 18 '20 at 15:55
  • I doubt `hadoop jar` is correct for Spark. There's no way to pass executor parameters, for example, also you shouldn't `setMaster` manually in the code, so therefore it wouldn't know to run in YARN – OneCricketeer Nov 18 '20 at 16:14
  • philantrovert, looks like haddop jar command is executing jar on hadoop: https://stackoverflow.com/questions/13012511/how-to-run-a-jar-file-in-hadoop. The question is how this is parallelized if it is not a mapreduce jar. – MiamiBeach Nov 18 '20 at 18:49
  • Well, it shouldn't run at all because `hadoop jar` isn't putting Spark libraries into the classpath. Nor should your uber jar contain spark-core – OneCricketeer Nov 19 '20 at 15:32

1 Answers1

1

As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.

The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.

Coursal
  • 1,387
  • 4
  • 17
  • 32
  • Thank you, Coursal! So the 2nd way is absolutely valid one? I wonder how Hadoop handles non-mapreduce jobs. Will this jar be executed on one node only which will serve as a driver for Spark? – MiamiBeach Nov 18 '20 at 18:47
  • They both are valid, it's just that the simple `spark-submit` one is most common and straight-forward. As for the handling of non-MR jobs, Hadoop can be extended with the tools/frameworks that are in its ecosystem (some extensions can be seen here: https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/). The command that takes the `.jar` file as parameter is gonna be executed on the master node/driver of the cluster, while the rest of the nodes are gonna process the load that the driver is gonna determine (with much of this being done entirely internally in Spark). – Coursal Nov 18 '20 at 19:29
  • As an extension to my comment and in regard to the while Hadoom ecosystem thing, the most barebone way to have something like that is to only use the HDFS of the Hadoop and work your jobs/programs on Spark. Yes, you could run Spark commands in Hadoop environments, but it kinda defeats the point since Spark outdoes almost every single aspect of Hadoop in terms of parallel execution. The main focus should be _in which framework of these two should I develop/design/adapt my job?_. – Coursal Nov 18 '20 at 19:44
  • Coursal, but what do you mean by "Yes, you could run Spark commands in Hadoop environments" ? Spark has its own cluster with several types of nodes and all spark commands are executed inside that cluster, right? So what do you mean by "run Spark commands in Hadoop environments"? – MiamiBeach Nov 19 '20 at 08:48
  • >>The command that takes the .jar file as parameter is gonna be executed on the master node/driver of the cluster - you mean Hadoop cluster or Spark cluster? – MiamiBeach Nov 19 '20 at 08:50
  • I mean a Spark cluster executing a job on top of Hadoop infrastructure like the HDFS (just like if Spark would take input and/or store data in cloud services like AWS). As far as where does the command executes, it's always the master node which will be the same node for both Hadoop and Spark. – Coursal Nov 19 '20 at 11:24
  • It's not a master "node", though. For Spark on YARN, an ApplicationMaster is picked at random from the pool of NodeManagers – OneCricketeer Nov 19 '20 at 15:35