I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]
Input data :
Incompressible data: Random Bytes in files
Total Data Size: 20GB
Each folder has varying input file size: From 2MB To 4GB file size.
Cluster Specifications :
1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \
Code :
scala> def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
time: [R](block: => R)R
scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};
Observations
- 2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]
- 64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB data e.g. 512 MB files had 40 files to make 20gb data and could just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.- 4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]
Questions
- Any default setting that forces input size to be dealt with to be 64MB ??
- Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?
- Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.