How Apache Spark partitions data of a big file

Question

Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.

I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.

But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.

PS: I have just started with Apache Spark. So, apologies if this is a naive question.

@Arj - In HDFS, updated above. Does it matter if it's in HDFS or local file system? — Anand, Jul 27 '18 at 16:31

Kaushal · Accepted Answer · 2018-07-27T19:01:29.400

As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)

So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)

Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.

Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.

It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.

Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver. So to process the data, you will be able to process 3 partitions in parallel. so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.

Hope that helps!

Cheers!

score 1 · Answer 2 · answered Jul 27 '18 at 16:53

1

To answer your question, if you have stored file in HDFS it is already partitioned based on your HDFS configuration i.e. if block size is 64MB, your total file will be divided in such blocks and spread across Hadoop cluster. Spark will generate tasks according to your num.executors configuration to decide how many parallel tasks can be executed. Expect no_of_hdfs_blocks=no_of_total_tasks.

Next what matters is how you are processing logic on this data, are you doing any shuffling of data, something similar to repartition(*) which will move the data around the cluster and change partition number to be processed by your spark job.

HTH!

answered Jul 27 '18 at 16:53

AbhishekN

368
4
8

1

128MB is default now I think – thebluephantom Jul 27 '18 at 19:35
Well explained, now it is clear how Spark processed a large file store in HDFS, partitioned as per block size. I have one doubt here instead of storing large file in HDFS if we store in S3 or network drive then in that case file will not be partitioned, so how Spark will process that file. Can someone explain this scenario. – Atanu chatterjee Mar 07 '21 at 17:32

How Apache Spark partitions data of a big file

2 Answers2