How many partitions does Spark create when a file is loaded from S3 bucket?

Question

If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?

score 3 · Answer 1 · answered Sep 29 '21 at 18:05

Even when reading a file from an S3 bucket, Spark (by default) creates one partition per block i.e. total no of partitions = total-file-size / block-size.

The value of block size for S3 is available as a property in Hadoop's core-site.xml file which is used by Spark:

<property>
  <name>fs.s3a.block.size</name>
  <value>32M</value>
  <description>Block size to use when reading files using s3a: file system.
  </description>
</property>

Unlike HDFS, AWS S3 is not a file system. It is an object store. The S3A connector makes S3 look like a file system.

Please checkout the documentation for more details.

score 2 · Accepted Answer · answered May 11 '16 at 21:31

2

See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits().

Block size depends on S3 file system implementation (see FileStatus.getBlockSize()). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play).

Also, you don't get splits if your InputFormat is not splittable :)

answered May 11 '16 at 21:31

Ivan Borisov

413
2
7

3

Please excuse my lack of knowledge on this, but how can I check these values for my cluster? – manmeet Apr 24 '18 at 02:05

score 0 · Answer 3 · answered Jun 28 '20 at 19:15

0

By default spark will create partitions of size 64MB when reading from s3. So a 100 MB file will be split into 2 partitions, 64MB and 36MB. An object having size less than or equal to 64 MB wont be split at all.

answered Jun 28 '20 at 19:15

mightyMouse

658
15
23

1

Do you have a reference for this? – Luke LaFountaine Nov 02 '20 at 19:42

How many partitions does Spark create when a file is loaded from S3 bucket?

3 Answers3

Linked