In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.
Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).
Software versions used
- Python version: 3.7.0
- PySpark version: 2.4.7
- Emr version: 5.32.0
If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.