For simplicty's sake and in pseudocode, if I do myDF = spark.read.option('inferSchema', True).json(someFiles)
and then I do myDF.count()
does spark read the data from disk twice?
Asked
Active
Viewed 42 times
-1

Joshua Cook
- 12,495
- 2
- 35
- 31
1 Answers
0
- If DAG contains only narrow transformations Spark will read data on each action.
- In fact in your case it will read data two times although there is only one action - Why does SparkSession execute twice for one action?
- If DAG contains wide transformations Spark can reuse shuffle files in some cases - What does "Stage Skipped" mean in Apache Spark web UI? - that however won't happen here.