why spark read csv generate three jobs

Question

I tried simple example on spark 2.1cloudra2:

val flightData2015 = spark
  .read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("/2015-summary.csv")

but when I check spark shell UI,I found it generate three jobs:

I think every action should related to a job,am I right? I do some experiment found out every option can generate a job. Does option act like action? please help understand this situation.

[Why does SparkSession execute twice for one action?](https://stackoverflow.com/q/38924623/10465355) — 10465355, Dec 21 '18 at 13:08

Subash · Answer 1 · 2018-12-21T11:13:51.987

-2

@yuxh,its because of the defaultMinPartitions which have been set to 3.It reflects Parallelism when a spark job is executed.You can change it in yarn-site.xml globally or dynamically specific to a job by issuing sqlContext.setConf("spark.sql.shuffle.partitions", "your value”)

edited Dec 21 '18 at 11:13

answered Dec 21 '18 at 11:04

Subash

887
1
8
19

I don't think so,I can reduce job by deleting option function and job has nothing to do with Parallelism – yuxh Dec 21 '18 at 11:51
What was the reason for downvote? Did you try executing the settings and started the spark job? – Subash Dec 21 '18 at 16:28

why spark read csv generate three jobs

1 Answers1