10

In Spark Web UI, there are two DAG visualizations, one for the Job: enter image description here

the other for the Stage: enter image description here

as explained here. The blog post does explain about the green dots in the Job's DAG, however, it says nothing about those green-shaded boxes in Stage's DAG. Could someone please give a hint?

Update: If that also means the code indicated is where data is cached, what can we do to improve the performance?

FuzzY
  • 660
  • 8
  • 23

1 Answers1

6

It is mentioned in the link you provided that

Second, one of the RDDs is cached in the first stage (denoted by the green highlight)

So the green boxes indicate that they are being cached and future reference to those rdds won't have to be generated from scratch.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • My understanding is that caching is about data not computing stages. Why is Stage 16 not greyed out if it contains 2 cached queries? – Jacek Laskowski Jul 05 '17 at 04:06
  • 1
    You know better @JacekLaskowski, so I am not going to argue with you as I learned spark reading your book. But I would like to say that may be the code is designed to cache after groupBy. – Ramesh Maharjan Jul 05 '17 at 04:33
  • Thanks for the kind words. Can we however focus on discussing the topic at hand? So, what do you think is the reason for the stage not being greyed out even though it contains cached RDDs? – Jacek Laskowski Jul 05 '17 at 09:44
  • the stage before groupBy is greyed out as you can see . – Ramesh Maharjan Jul 05 '17 at 09:48
  • Why? Are you saying that the stages before RDDs cached should always be greyed out? It's not consistent with the screenshot though, is it? – Jacek Laskowski Jul 05 '17 at 09:52
  • It looks like that if you look at the screenshot. If the rdds are cached and if the parent rdds are not used by any other functions then they are greyed out, isn't it? – Ramesh Maharjan Jul 05 '17 at 09:58
  • I don't think its a duplicate as this question is asking explanation for the green boxes and the other one is for shaded boxes. I just wanted you to explain why a downvote for my answer? – Ramesh Maharjan Jul 07 '17 at 07:19