2

I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:

When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).

Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.

Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?

2 Answers2

1

I'm pretty new to Spark, but splits are strictly related to MapReduce jobs. Spark loads the data in memory in a distributed fashion and which machines will load the data can depend on where the data are (read: somewhat depends on where the data block are and this is very close to the split idea ). Sparks APIs allows you to think in terms of RDD and no longer splits. You will work on RDD, how are distributed the data into the RDD is no longer a programmer problem. Your whole dataset, under spark, is called RDD.

ozw1z5rd
  • 3,034
  • 3
  • 32
  • 49
  • But then how does partitioning work in an RDD? Like as you said "Your whole dataset under spark, is called RDD". So how does the RDD partition that entire data it gets from the HDFS splits into partitions in the RDD? –  Oct 08 '16 at 21:11
  • Each node loads its part. Spark will talk with YARN which will allocate the requested resources. Data locality is always the best thing but it is not always guaranteed. At this level there are not splits you are working with data blocks. File's blocks are loaded into the datanode which have the containers allocated by YARN, hopefully these nodes are the same which hold the data. – ozw1z5rd Oct 08 '16 at 21:21
  • So Spark will talk to YARN to allocate the requested resources to run the Spark transformations and actions on the specified data set in HDFS. I don't think you are answering my question in the sense that when the Spark application is sent to the executors (data nodes in HDFS) and the data from the HDFS inputs splits are put into the RDD abstractions, does the RDD further partition that entire data it gets from the HDFS splits? –  Oct 08 '16 at 21:27
  • Your data are into the memory of the data node which are running the container allocated for the Spark Job. Each node work on this partition of the data. Perhaps you want to know how the data read from datanode are distributed between computational node. Spark will run one task for each partition of the cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually. – ozw1z5rd Oct 08 '16 at 21:34
  • Perhaps this will help you: [spark programming guide](https://spark.apache.org/docs/latest/programming-guide.html) – ozw1z5rd Oct 08 '16 at 21:35
  • Thank you very much for the help. I noticed this question answered my concerns: http://stackoverflow.com/questions/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs I didn't look hard enough haha... –  Oct 08 '16 at 22:28
0

Hope the below answer would help you.

When Spark reads a file from HDFS, it creates a single partition for a single input split.

If you have a 30GB text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions.

chaitra k
  • 371
  • 1
  • 4
  • 18