0

My Code:

scala> val records = List( "CHN|2", "CHN|3" , "BNG|2","BNG|65")
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65)

scala> val recordsRDD = sc.parallelize(records)
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23

scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|"))
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25

scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1)))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27

scala> keyvalueRDD.count
res12: Long = 5

As you can see above there are 3 RDD's created.

My question is When does DAG gets created and What a DAG contains ?

Does it get created when we create a RDD using any transformation?

or

Does it created when we call a Action on existing RDD and then spark automatically launch that DAG?

Basically I want to know what happens internally when a RDD gets created?

Surender Raja
  • 3,553
  • 8
  • 44
  • 80
  • 3
    Possible duplicate of [Where to learn how DAG works under the covers in RDD?](http://stackoverflow.com/questions/25836316/where-to-learn-how-dag-works-under-the-covers-in-rdd) – Jacek Laskowski Dec 29 '16 at 16:33

1 Answers1

0
  • DAG is created when job is executed (when you call an action) and it contains all required dependencies to distributed tasks.

  • DAG is not executed. Based on DAG Spark determines tasks which are distributed to the workers and executed.

  • RDD alone defines lineage by traversing recursively dependencies.

user7337271
  • 1,662
  • 1
  • 14
  • 23
  • "it contains all required dependencies to distributed tasks." -- given the 2nd item I think you consider a DAG == a RDD lineage (which is correct and is a DAG of RDD dependencies). A DAG of stages is different and is created by DAGScheduler. It's only after DAGScheduler when the first DAG is transformed into the other and only then tasks show up. – Jacek Laskowski Dec 29 '16 at 16:38
  • So you synonymize DAG and lineage here? – user7337271 Dec 29 '16 at 16:41
  • 1
    It depends on the stage of a Spark job's processing (pun unintended). RDD lineage _is_ indeed a DAG of RDD dependencies. – Jacek Laskowski Dec 31 '16 at 16:00
  • @JacekLaskowski Difference could be that lineage is made up of only transformations whereas DAG contains both transformations and actions. Correct me if I am wrong. – Anand Mar 04 '20 at 17:34
  • A DAG is only transformations and an action executes it. – Jacek Laskowski Mar 04 '20 at 17:56