I worked on Amazon EMR with Spark, based on this documentation from Amazon (https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/), it said Amazon EMR does not currently support use of the Apache Hadoop S3A file system, The s3a:// URI is not compatible with Amazon EMR
.
However, I am able to do read and write in a spark job using "s3a://" without issue. (Note: I am using"com.amazonaws" % "aws-java-sdk-s3" % "1.11.286"
and EMR version is emr-5.11.0
). I did some searching but still found myself confused on which file system is currently suggested to use with EMR.
Any help would be appreciated.

- 2,531
- 2
- 27
- 41
2 Answers
EDIT: forgot to state this but this is built using Spark version 2.3.0 in mind.
AWS EMR has three policies that can be used in Spark s3a, s3e, and s3. s3a and s3e are relatively older polcies used to connect to the environment that can be used in and outside of AWS. Whereas, s3 is a policy built specifically for AWS EMR connecting to s3. From testing I have found that read and write is faster using the s3 policy, and unlike the other policies you do not need to pass through user/password/keys/libraries. The only thing that is needed is that the user in the VPC, or otherwise, has access to the buckets, and spark, when the spark shell is active. Below is how you can read and write using S3:
spark-shell
///In this case I am reading a csv from a bucket called myBucket into the environment
val inputDF = spark.read.format("csv").option("header","true").load("s3://myBucket/fooBar.csv")
///I am then writing that file back out using the s3 policy back to the environment
inputDF.write.format("csv").save("s3://myBucket/ODS/")
The biggest issue you will see with reading a loading from spark is partitions, meaning whatever partitioning Spark decides to have on the object element before read is how many part files it will write. You might want to look into implementing a repartitioning strategy if you are looking for speed up you reads and writes.
import org.apache.spark.util.SizeEstimator
val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
//write it out with that many partitions
val outputDF = inputDF.repartition(numPartitions.toInt)

- 492
- 2
- 10
-
please refer to: https://stackoverflow.com/questions/37077432/how-to-estimate-dataframe-real-size-in-pyspark to use the estimator method in pyspark – enneppi Nov 19 '18 at 11:32
The "s3a" is part of Apache Hadoop thus still available in EMR.
Recommended s3 client in EMR is EMRFS, hence you can still use either of them, s3a (Apache Hadoop) or s3/s3n (EMRFS). The last one have it's own advantages, like consistent view.

- 26
- 1