3

I'm using Hadoop 2.6, and I have a cluster of Virtual Machines where I installed my HDFS. I'm trying to remotely read a file in my HDFS through some Java code running on my local, in the basic way, with a BufferedReader

    FileSystem fs = null;
    String hadoopLocalPath = "/path/to/my/hadoop/local/folder/etc/hadoop";
    Configuration hConf = new Configuration();
    hConf.addResource(new Path(hadoopLocalPath + File.separator + "core-site.xml"));
    hConf.addResource(new Path(hadoopLocalPath + File.separator + "hdfs-site.xml"));
    try {
        fs = FileSystem.get(URI.create("hdfs://10.0.0.1:54310/"), hConf);
    } catch (IOException e1) {
        e1.printStackTrace();
        System.exit(-1);
    }        
    Path startPath = new Path("/user/myuser/path/to/my/file.txt");

    FileStatus[] fileStatus;
    try {
        fileStatus = fs.listStatus(startPath);
        Path[] paths = FileUtil.stat2Paths(fileStatus);

        for(Path path : paths) {
            BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(path)));
            String line = new String();
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }
            br.close();
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }   

The program can access correctly the HDFS (no exception are risen). If I ask to list the files and directories via code, it can read them without problems.

Now, the issue is that if I try to read a file (as in the code shown), it gets stuck while reading (in the while), until it rises the BlockMissingException

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2005327120-10.1.1.55-1467731650291:blk_1073741836_1015 file=/user/myuser/path/to/my/file.txt
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:888)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:568)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
at java.io.DataInputStream.read(DataInputStream.java:149)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at uk.ou.kmi.med.datoolkit.tests.access.HDFSAccessTest.main(HDFSAccessTest.java:55)

What I already know:

  • I tried the same code directly on the machine running the namenode, and it works perfectly
  • I already checked the log of the namenode, and added the user of my local machine to the group managing the HDFS (as suggested by this thread, and other related threads)
  • There should not be issues with fully-qualified domain names, as suggested by this thread, as I'm using static IPs. On the other hand, the "Your cluster runs in a VM and its virtualized network access to the client is blocked" can be an option. I would say that if it is like that, it shouldn't allow me to do any action on the HDFS (see next point)
  • The cluster run on a network with a firewall, and I have correctly open and forwarded the port 54310 (I can access the HDFS for other purposes, as creating files, directories, and listing their content). I wonder if there are other ports to open needed for file reading
Community
  • 1
  • 1
McKracken
  • 389
  • 4
  • 17

1 Answers1

0

Can you make sure that Datanode is also accessible from the client ? I had similar issue when connecting Hadoop configured in AWS . I am able to resolve the issue , by conforming connection between all datanodes and my client system

sterin jacob
  • 141
  • 1
  • 10
  • Probably this is the problem: the network to which all the VMs are connected is managed by an access point that regulates the accesses. Of course, it applies restrictions on the access with a firewall and port-forwarding mechanism that forbids my client to access the Datanodes. The Namenode is accessible only because I opened and forwarded the port 54310 to the Namenode. I can connect to the network only through the access point. Now I wonder how can I make the Datanodes accessible. – McKracken Aug 03 '16 at 14:22
  • You can try to step dfs.datanode.address in hdfs-default.xml and try to do SSH forward from the client . https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml – sterin jacob Aug 03 '16 at 14:39
  • Ok, I think I have to change the `hdfs-site.xml` of my datanodes, specifying the options `dfs.datanodes.address`,`dfs.datanodes.ipc.address`,`dfs.datanodes.http.address`,`dfs.datanodes.https.address`, setting a different port for every datanode, and then forwarding that ports in my access point. Is that correct? – McKracken Aug 03 '16 at 14:40
  • Sorry, it was concurrent commenting – McKracken Aug 03 '16 at 14:40
  • So, I have two datanodes, `dn1` and `dn2`. I added a specific port in each `hdfs-site.xml` through the `dfs.datanodes.address` option, that is `0.0.0.0:50011` and `0.0.0.0:50012` respectively. I restarted the HDFS. I opened the public ports 50011 and 50012 on my firewall, and I forwarded public port 50011 to port 50011 of `dn1`, and public port 50012 to port 50012 of `dn2`. Still not working, but I feel I'm close. What's wrong? – McKracken Aug 03 '16 at 15:00
  • instead of 0.0.0.0:50011 , can you try to listen on IP:50011? – sterin jacob Aug 03 '16 at 15:27
  • I tried listening on local IP of the machines (e.g. 192.168.1.55:50011, 192.168.1.93:50012), and also putting the IP of the network there (e.g. 10.0.0.1:50011, 10.0.0.1:50012). Of course, in the second case the datanodes did not start. In the first case, nothing changed. Still not able to access the datanodes remotely. Does everything work on TCP protocol? I opened and forwarded ports for TPC. Wondering if it should be UDP. – McKracken Aug 03 '16 at 16:04
  • @McKracken ., any luck so far? – sterin jacob Aug 04 '16 at 13:15
  • Not yet. I could check if the client can communicate with the datanodes, and it's ok. Ports are open and forwarded. What I'm not sure about is whether the datanodes can locate back (know the address) the client, given my network configuration. I'm gonna run some tcpdump tests on both sides to understand it. The final guess is that it might depend on the `hdfs-site.xml` I'm loading on the client side in the configuration for the `FileSystem` class (I did not report it in the code, I should edit it). Maybe it's miselading the communication somehow? – McKracken Aug 04 '16 at 13:54