1

I'm kind of new to Hadoop HDFS and quite rusty with Java and I need some help. I'm trying to read a file from HDFS and calculate the MD5 hash of this file. The general Hadoop configuration is as below.

private FSDataInputStream hdfsDIS;
private FileInputStream FinputStream;
private FileSystem hdfs;
private Configuration myConfig;

myConfig.addResource("/HADOOP_HOME/conf/core-site.xml");
myConfig.addResource("/HADOOP_HOME/conf/hdfs-site.xml");

hdfs = FileSystem.get(new URI("hdfs://NodeName:54310"), myConfig);

hdfsDIS = hdfs.open(hdfsFilePath);

The function hdfs.open(hdfsFilePath) returns an FSDataInputStream

The problem is that i can only get an FSDataInputStream out of the HDFS, but i'd like to get a FileInputStream out of it.

The code below performs the hashing part and is adapted from something i found somewhere on StackOverflow (can't seem to find the link to it now).

FileInputStream FinputStream = hdfsDIS;   // <---This is where the problem is
MessageDigest md;
    try {
        md = MessageDigest.getInstance("MD5");  
        FileChannel channel = FinputStream.getChannel();
        ByteBuffer buff = ByteBuffer.allocate(2048);

        while(channel.read(buff) != -1){
            buff.flip();
            md.update(buff);
            buff.clear();
        }
        byte[] hashValue = md.digest();

        return toHex(hashValue);
    }
    catch (NoSuchAlgorithmException e){
        return null;
    } 
    catch (IOException e){
        return null;
    }

The reason why i need a FileInputStream is because the code that does the hashing uses a FileChannel which supposedly increases the efficiency of reading the data from the file.

Could someone show me how i could convert the FSDataInputStream into a FileInputStream

Irvin H.
  • 602
  • 1
  • 10
  • 17
  • Have you at least *tried* just hashing with the existing stream? You say that using `FileChannel` "supposedly" increases the efficiency - have you tested that and found that you actually need any performance improvement potentially gained? – Jon Skeet Sep 30 '13 at 16:53
  • Actually, I've not tried using the "existing stream" but i think i might as well try it to see if it works. By the way, i found the link to where i got the code from (http://stackoverflow.com/a/9322214/2105711) – Irvin H. Sep 30 '13 at 18:33
  • 1
    Ok, so i've used the existing stream, read from it, calculated the hash and it works. Perhaps i can forget about the supposed "performance improvement" for now. Though hopefully someday someone might come up with the solution for using NIO FileChannels that i was looking for. (If it is seen to be useful for large files, that is) – Irvin H. Sep 30 '13 at 20:41

3 Answers3

2

Use it as an InputStream:

MessageDigest md;
try {
    md = MessageDigest.getInstance("MD5");  
    byte[] buff = new byte[2048];
    int count;

    while((count = hdfsDIS.read(buff)) != -1){
        md.update(buff, 0, count);
    }
    byte[] hashValue = md.digest();

    return toHex(hashValue);
}
catch (NoSuchAlgorithmException e){
    return null;
} 
catch (IOException e){
    return null;
}

the code that does the hashing uses a FileChannel which supposedly increases the efficiency of reading the data from the file

Not in this case. It only improves efficiency if you're just copying the data to another channel, if you use a DirectByteBuffer. If you're processing the data, as here, it doesn't make any difference. A read is still a read.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Yes, this is more or less what I did in avoiding the `FileChannel` as was suggested earlier. Thanks also for the further explanation. – Irvin H. Oct 04 '13 at 22:45
0

You can use the FSDataInputStream as just a regular InputStream, and pass that to Channels.newChannel to get back a ReadableByteChannel instead of a FileChannel. Here's an updated version:

InputStream inputStream = hdfsDIS;
MessageDigest md;
try {
    md = MessageDigest.getInstance("MD5");  
    ReadableByteChannel channel = Channels.newChannel(inputStream);
    ByteBuffer buff = ByteBuffer.allocate(2048);

    while(channel.read(buff) != -1){
        buff.flip();
        md.update(buff);
        buff.clear();
    }
    byte[] hashValue = md.digest();

    return toHex(hashValue);
}
catch (NoSuchAlgorithmException e){
    return null;
} 
catch (IOException e){
    return null;
}
Joe K
  • 18,204
  • 2
  • 36
  • 58
-1

You can' t do that assignment because:

java.lang.Object
extended by java.io.InputStream
extended by java.io.FilterInputStream
extended by java.io.DataInputStream
extended by org.apache.hadoop.fs.FSDataInputStream

FSDataInputStream is not a FileInputStream.

That said to convert from FSDataInputStream to FileInputStream,

you could user FSDataInputStream FileDescriptors to create a FileInputStream according to the Api

new FileInputStream(hdfsDIS.getFileDescriptor());

Not sure it will work.

Ricardo
  • 109
  • 3