I have a 30 TB file in HDFS. Now, I am reading that file in spark. But after reading the file where that data will store? Suppose:
val customerDF = spark.read.format("csv").load("/path/to/file.csv")
Where will customerDF
be stored?
I have a 30 TB file in HDFS. Now, I am reading that file in spark. But after reading the file where that data will store? Suppose:
val customerDF = spark.read.format("csv").load("/path/to/file.csv")
Where will customerDF
be stored?
It won't be stored anywhere until you need to process it, this is called lazy evaluation. Spark will generate a graph (DAG) with all the transformations that it needs to perform, and then it need to persist the dataframe or perform an action over it, it will be loaded to memory and processed.
You algo have the persist
command on a dataframe to make it persistent, there you can select a different StorageLevel
df.persist(cachePolicy)
More info about Storage Level hereStorages Level
Based on your example, the file will not be read yet and nothing is stored anywhere at that point in time. Spark is lazy, it only reads things when an action like: write, count, collect, etc is called. If you do not use any sort of caching of the dataframes (via cache or persist), than what will be read and how much of it will be used from the file will depend on the following operations that caused projections: select, groupBy, join, etc. If you use shuffle operations (groupBy, window functions, joins), than projected data will be written to tmp folders on the worker/data nodes, to facilitate communication between the stages.
Example:
val customerDF = spark.read.format("").load("/path") //Files are not read yet
val customerStats = customerDF.groupBy("customer_id").count() //Files are not read yet
customerStats.show(100, false)
In the above example files are read only on show command, only customer_id is extracted and due to count in stage 1 partial counts are stored into SPARK_LOCAL_DIRS and sent to stage 2 which does final rollup and display on the screen of 100 lines.