Why do we need HDFS on EMR when we have S3

Question

In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.

Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).

Software versions used

Python version: 3.7.0
PySpark version: 2.4.7
Emr version: 5.32.0

If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.

Perhaps this can help? https://stackoverflow.com/questions/32022334/can-apache-spark-run-without-hadoop — mazaneicha, Jan 24 '23 at 14:59

score 2 · Answer 1 · answered Jan 24 '23 at 08:34

2

HDFS in EMR is a built-it component that is provided to store secondary information such as credentials if your spark executors need to authenticate themselves to read a resource, another use is to store log files, in my personal experience I used it as a staging area to store a partial result in a long computation, so that if something went wrong in the middle I would have a checkpoint from which to resume execution instead of starting the computation from scratch, it is strongly discouraged to store the final result on HDFS.

answered Jan 24 '23 at 08:34

afjcjsbx

169
2
8

Thanks, @afjcjsbx for the answer. That was my guess as well but getting confirmation from the open public would increase my confidence and that's the reason for this question. Will wait for a few more days for any more answers before I can mark yours as an answer. – Venkatesan Muniappan Jan 24 '23 at 08:41
And one more thing @afjcjsbx, Do you have any more details about how HDFS can be used for authentication? This one is completely new for me. – Venkatesan Muniappan Jan 24 '23 at 08:48
I can give you information since implementing the code here on stackoverflow would be onerous, in a past experience I ran into having to read from an s3 bucket that imposed authentication via aws assume role, so I had AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN that I was using for authentication with Spark, and for greater reliability I was storing this information in a temporary folder in hfs so that if the executor died I could retrieve these credentials instead of asking for new credentials from the authentication service. – afjcjsbx Jan 24 '23 at 10:00

score 1 · Accepted Answer · answered Jan 28 '23 at 20:46

Spark on EMR runs on YARN, which itself uses HDFS. The Spark executors run inside of YARN containers, and Spark distributes the Spark code and config by placing it in HDFS and distributing it to all of the nodes running the Spark executors in the YARN containers. Additionally, the Spark Event Logs from each running and completed application are stored in HDFS by default.

Why do we need HDFS on EMR when we have S3

2 Answers2