Hadoop with binary files

Question

If I have a command line program with input and output like this:

md5sum < hadoop-2.7.2.tar.gz 
c442bd89b29cab9151b5987793b94041  -

How can I run it using Hadoop? This seems to be an embarassingly simple problem, but none of the solutions I tried have produced the correct output:

Maybe, I just wasn't able to follow the instructions correctly. So, please, explain in some detail or point at least at helpful documentation.

What exactly is your question? Which part of Hadoop are you targeting? Are you just trying to run a Linux command on data using mapreduce? — OneCricketeer, May 07 '16 at 07:02
md5sum is only a placeholder for an other program with the same interface. It expects binary input in whole files and creates text output. This will be executed on lots of files (~100000 files) many times with slight modifications in the program settings. A cluster will be needed and I want to use Hadoop to distribute the job. The files need to be stored in HDFS and there should be data locality. So, yes, I am just trying ot run a Linux command on data using mapreduce and HDFS. — Markus Heitz, May 08 '16 at 15:19
You could see [How to read a single file in Hadoop](http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop) followed by whatever Java code you want to run on that file. If you don't understand the concept of mapreduce, then running and understanding a hello world example of wordcount would be good. — OneCricketeer, May 08 '16 at 15:23
Does this work with binary files as input? Maybe I am mistaken but it looks like a text file reader to me. Then the md5sums will be wrong. — Markus Heitz, May 08 '16 at 15:35
Mapreduce relies on the ability to create file splits of the input files to make the jobs require less memory. While you could override that behavior, it doesn't provide much benefit because then you've gone back to just iterating over files in a regular distributed filesystem. I think you might want this, though http://stackoverflow.com/a/10533275/2308683 — OneCricketeer, May 08 '16 at 15:41

score 0 · Answer 1 · answered May 08 '16 at 22:21

You might be able to use WholeFileInputFormat, and hadoop streaming. The problem you might run into is if you have huge files that you want to read fully - but if you have strong requirement to have whole file as input to your program, then you should either make sure input is resonable or find a better algorithm, to fully embrace MR’s splits and scalability.

Hadoop with binary files

1 Answers1

Linked