2

I am new to Spark & learning about the Dataframe,operations & architecture. While reading about the comparison between RDD and Dataframe, i got confused with the data structure of both RDD and Dataframe. Below are my observation, Please help to clarify/correct it if it is wrong

1)RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster,if the source data is an a cluster(eg: HDFS).

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

2)Is there any relationship between block and partition? Which one is super set?

3)Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

Thanks in advance :)

NikRED
  • 1,175
  • 2
  • 21
  • 39
  • 1
    You could find lot of reading material online for Spark. Even Apache Spark documentation is quit detailed. In addition to that you could also refer https://jaceklaskowski.gitbooks.io/mastering-spark-sql/ – hagarwal Sep 12 '19 at 09:23

1 Answers1

3

RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster, if the source data is an a cluster(eg: HDFS).

If caching or checkpointing is enabled it is also might be stored either in memory or on disk. Also, shuffling always involves disk write.

If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?

CSV file will be split into multiple partitions, and each task will only read a chunk of data (start-end offsets).

Is there any relationship between block and partition? Which one is super set?

It is a bit confusing, take a look at this answer which states that split is a logical division of the input data while a block is a physical division of data. Spark uses its own terminology and partition in Spark has roughly the same meaning as split in Hadoop.

When a file is read from HDFS HadoopRDD is being used and under the hood, each split will become a partition.

Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?

Dataframe is nothing else than RDD[InternalRow] under the hood.
Take a look at the SparkPlan.

Gelerion
  • 1,634
  • 10
  • 17