Spark RDD partitions while reading CSV from AWS S3

Asked Feb 19 '19 at 16:12

Active Feb 19 '19 at 16:12

Viewed 364 times

I am unable to figure out how spark is deciding on number of partitions while reading from AWS S3

My Case:

I am using Spark 1.3 (sorry, but not in my hand)

My S3 contains 100 csv files each of size ~60-75MB in batches i.e. folder1,folder2,folder3,etc contains 100 CSV files each

I'm getting partitions 295-300 while reading from this folders

I'm expecting that default partitions to be 200 always because if spark understands S3 data as a block-based system then it should read either 64MB or 128MB.

Thanks in advance.

asked Feb 19 '19 at 16:12

Vaibhav KB

1,695
19
17

2

In general [block size and split size belong to different layers](https://stackoverflow.com/q/17727468/10465355). Ideally these would align, in practice it rarely happens (and in latest versions Spark [doesn't even respect Hadoop split configuration](https://stackoverflow.com/q/38249624/10465355)). So the numbers don't look that alarming. – 10465355 Feb 19 '19 at 16:43
can you please explain how this number i.e 300 partitions came ? Please – Vaibhav KB Feb 19 '19 at 17:02

Spark RDD partitions while reading CSV from AWS S3

0 Answers0