I was looking for a way to create a simple small remote HDFS system for a proof of concept.
There a lot of guides available out there to create a HDFS system using AWS EC2 instances e.g.
Example HDFS guide using AWS EC2 instances
These all satisfy the requirement to be able to write to the HDFS from the master/namenode however I can't seem to find any example of a HDFS setup when attempting to write from a remote client. There is an issue here where, when the HDFS configuration provided by the guide results in the namenode using the AWS internal private DNS when providing the external client with a datanode. As a result I see errors consistent with:
HDFS error: could only be replicated to 0 nodes, instead of 1
However I cannot get any of the recommended solutions to work no matter what permutation of hostname (public or private DNS or short form), or permutation of /etc/hosts or hdfs-site.xml properties, as suggested by:
Another example of suggested solution
The usual test for this issue is to try and download a test file located on the HDFS via the web front end where the issue can clearly be seen as an AWS private DNS URL being generated for the file download.
I've been using version 2.7.1 of the Hadoop HDFS.
I was wondering whether what I am trying to achieve is possible or whether I should be looking at a more mature HDFS offering rather than trying to build my own bespoke one?