0

I was looking for a way to create a simple small remote HDFS system for a proof of concept.

There a lot of guides available out there to create a HDFS system using AWS EC2 instances e.g.

Example HDFS guide using AWS EC2 instances

These all satisfy the requirement to be able to write to the HDFS from the master/namenode however I can't seem to find any example of a HDFS setup when attempting to write from a remote client. There is an issue here where, when the HDFS configuration provided by the guide results in the namenode using the AWS internal private DNS when providing the external client with a datanode. As a result I see errors consistent with:

HDFS error: could only be replicated to 0 nodes, instead of 1

However I cannot get any of the recommended solutions to work no matter what permutation of hostname (public or private DNS or short form), or permutation of /etc/hosts or hdfs-site.xml properties, as suggested by:

Example of suggested solution

Another example of suggested solution

The usual test for this issue is to try and download a test file located on the HDFS via the web front end where the issue can clearly be seen as an AWS private DNS URL being generated for the file download.

I've been using version 2.7.1 of the Hadoop HDFS.

I was wondering whether what I am trying to achieve is possible or whether I should be looking at a more mature HDFS offering rather than trying to build my own bespoke one?

Huw
  • 533
  • 1
  • 7
  • 15
  • 1
    Is there a reason you're not using EMR or Hortonworks/Cloudera or some pre-made Terraform/CloudFormation template? – OneCricketeer Jul 15 '18 at 20:32
  • I think I was trying to start with the simplest example I could find as a learning excercise and make use of the free tier where possible. I thought that EMR might be a bit excessive to begin with. Hortonworks/Cloudera were definitely solutions I came across when investigating this issue although I have no feel for how simple/complicated they are to deploy and I don't know how demanding they are of resource? I've yet to come across any cloud provisioning tool templates though - although I may have been looking in the wrong place? – Huw Jul 16 '18 at 08:21
  • 1
    If you're looking to do Hadoop for free, and have at least 8Gb of RAM, I'd suggest using a local VM, not a cloud provider. I haven't personally tried installing either HDP or CDH on that tier. Hadoop overall requires much more resources than the free tier provides, so it's not useful other than practicing installation, which again, can be done just as easily in a VM (probably easier because not dealing with firewalls and security groups) – OneCricketeer Jul 16 '18 at 12:18
  • Thanks for this. I did think about running a VM locally but then thought it would be better to have a remote filesystem for the proof of concept I'm working on. My data files aren't very large at all for the PoC so I assumed I could get away with free tier spec for the HDFS part. I'm going to take a look at HDP to try and understand what the minimum configuration is - although I'm still not sure whether this will get around the existing issue with the private DNS... – Huw Jul 16 '18 at 15:28
  • 1
    If you want to use HDP, then [try using Cloudbreak](https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.7.1/index.html). If your files are not large, then use S3 for cheap storage and run Spark from your local computer. HDFS does not do well with small files. Regarding DNS, then I think this might lead in the right path https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/ – OneCricketeer Jul 16 '18 at 16:36
  • Thanks again for the response. I think the data sets are only small for the PoC and will be bigger in production hence why I'm persevering with HDFS. I'll take a look at Cloudbreak thanks. Interestingly I also found the blog post you mention. Where is suggests: _you have to add the following block to your hdfs-site.xml on the client side:_ ` dfs.client.use.datanode.hostname true ` What isn't clear to me is what client side means in this context. Should all HDFS clients have this file or does it mean the namenode? – Huw Jul 17 '18 at 08:47
  • I just tried modifying the hdfs-site.xml on the namenode and datanodes and updating the hostnames of all of these but still no joy unfortunately... – Huw Jul 17 '18 at 10:34
  • 1
    An HDFS client is every process of HDFS. Namenodes, Datanodes, normal clients, Secondary Namenode, etc – OneCricketeer Jul 17 '18 at 13:39
  • Update: I tried Cloudbreak but came up against the same issue so I'm now looking at how to modify my client correctly to incorporate the use of datanode hostnames. – Huw Jul 20 '18 at 09:38

0 Answers0