3

I have finished some code in spark notebook, I tried to move it into a real project, and use sbt to generate a jar, then use the spark-submit to execute it.

Problem: It takes just 10 minutes to get the result in spark notebooks, but it takes almost 3 hours to get the result when I use the command spark-submit.

Info: I configured the spark, scala version, and parameters(master url, execution core/memory, etc.) are all the same between notebooks and spark-submit.

Suspect 1: maybe because of the logs(LogFactory.getLog().info("xxxx"))? which make the program take too time to print them or save them ?

Suspect 2: maybe because of the code? I didn't do any big changes to the code of notebook, just create a function, put the code inside and run it. Should I do some partitions or something?

Charlie Brumbaugh
  • 219
  • 1
  • 5
  • 15
Leyla Lee
  • 466
  • 5
  • 19
  • Here's a way to test what's going on: Open a notebook with a fresh spark session (presumably this takes ~1 minute), check your YARN or Spark UI to make sure nothing has run yet. Run the code in the notebook and look at the resources used in the UI (# of Tasks, # of executors, cache usage, time taken per task, etc). Then run the same code through spark-submit jar and see what stages are taking longer. Do they have the same # of tasks, did they have the same DAG execution plan, did they take the same amount of time per task, etc? – Garren S Oct 18 '17 at 19:57

0 Answers0