12

I would like to store some videos/images into Hadoop HDFS, but I heard that HDFS accepts only files like as a text.

To be sure, can we store videos/images into HDFS? If yes, what's the way or the steps to follow to do that?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
devosJava
  • 263
  • 1
  • 3
  • 12

2 Answers2

24

It is absolutely possible without doing anything extra. Hadoop provides us the facility to read/write binary files. So, practically anything which can be converted into bytes can be stored into HDFS(images, videos etc). To do that Hadoop provides something called as SequenceFiles. SequenceFile is a flat file consisting of binary key/value pairs. The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. So, you could convert your image/video file into a SeuenceFile and store it into the HDFS. Here is small piece of code that will take an image file and convert it into a SequenceFile, where name of the file is the key and image content is the value :

public class ImageToSeq {
    public static void main(String args[]) throws Exception {

        Configuration confHadoop = new Configuration();     
        confHadoop.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
        confHadoop.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));   
        FileSystem fs = FileSystem.get(confHadoop);
        Path inPath = new Path("/mapin/1.png");
        Path outPath = new Path("/mapin/11.png");
        FSDataInputStream in = null;
        Text key = new Text();
        BytesWritable value = new BytesWritable();
        SequenceFile.Writer writer = null;
        try{
            in = fs.open(inPath);
            byte buffer[] = new byte[in.available()];
            in.read(buffer);
            writer = SequenceFile.createWriter(fs, confHadoop, outPath, key.getClass(),value.getClass());
            writer.append(new Text(inPath.getName()), new BytesWritable(buffer));
        }catch (Exception e) {
            System.out.println("Exception MESSAGES = "+e.getMessage());
        }
        finally {
            IOUtils.closeStream(writer);
            System.out.println("last line of the code....!!!!!!!!!!");
        }
    }
}

And if your intention is to just dump the files as it is, you could simply do this :

bin/hadoop fs -put /src_image_file /dst_image_file

And if your intent is more than just storing the files, you might find HIPI useful. HIPI is a library for Hadoop's MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment.

HTH

Tariq
  • 34,076
  • 8
  • 57
  • 79
  • Nice example, As you know when we work with hadoop, that's mean a huge amount of data, then huge number of images, i think we can iterate on a directory to read all the images and store them in HDFS? another question, can we apply the same code for videos ?? thank you – devosJava May 14 '13 at 15:03
  • i dont know if i should store as it is or not. because i would like to apply some transormations, what do you think ? let it as it is or not ? – devosJava May 14 '13 at 15:06
  • 1
    i would suggest you to club multiple files into 1 Sequence file and then store it. It would be more efficient, as Hadoop is good at processing "small no. of big files". and it should be quite possible to do the transformations. although i have never tried the video files, but the process should be same. – Tariq May 14 '13 at 15:10
  • i should group many images in one sequence file, for 1 000 000 how many sequence file will i need for example ? – devosJava May 14 '13 at 15:12
  • well that depends on your particular use case and few other factors like the size of each file etc. – Tariq May 14 '13 at 15:15
  • but to execute this code, should i convert this class o jar file and execute it on hadoop or what can i do, give me the steps to follow please. thanks – devosJava May 14 '13 at 20:16
  • it's upto you whether you want to create a jar file or run it directly from your IDE. anything is fine. and, since it is not a MR job you will be running it on a single machine. – Tariq May 15 '13 at 09:44
  • i prefere to package into jar file (there is no needed jars to add to classpath ?) – devosJava May 15 '13 at 14:54
  • @Tariq Thank you the code for convert image to sequence. But I am facing problem that how to read the original images from sequence file. I read out using `imag = ImageIO.read(new ByteArrayInputStream(file.getImage().getBytes()));`, which file is a custom writable, but I got a error that `Error reading PNG image data`. – hakunami Apr 20 '14 at 08:50
  • Thanks for explanation and example. How can I divide the the sequence file into map/reduce task using JavaCV/OpenCV? – Tariq Jan 13 '15 at 00:30
  • Caution : `in.available()` may not be equal to the total number of bytes – Thamme Gowda May 28 '16 at 01:22
3

It is entirely possible to store images and video on HDFS, but you will likely need to use/write your own custom InputFormat, OutputFormat and RecordReader in order to split them properly.

I imagine others have undertaken similar projects however, so if you scour the net you might be able to find that someone has already written custom classes to do exactly what you need.

Quetzalcoatl
  • 3,037
  • 2
  • 18
  • 27
  • ok, but what do you mean when you say : but you will need to write your own custom InputFormat, OutputFormat and RecordReader in order to split them properly. "split what "?? thank you – devosJava May 14 '13 at 14:53
  • 1
    `InputFormat` is responsible for splitting your input image/video files up for distribution across the cluster to your mappers and reducers. You'll need to write your own one as the default `InputFormat` classes such as `FileInputFormat` are designed for text, not video or image content. – Quetzalcoatl May 14 '13 at 14:55
  • If you click on the links through to the javadoc it has all of this information readily available, a quick google can find you anything else you want to know about them - that's how I learned! – Quetzalcoatl May 14 '13 at 14:58
  • 2
    I don't think one needs a custom InputFormat just to store the files into HDFS. even a simple "bin/hadoop fs -put /src_image_file /dst_image_file" would do the trick. – Tariq May 14 '13 at 14:59
  • @Tariq Well yes of course, but the binary would be effectively meaningless. It's a filesystem, you can store anything on it if you just treat it as binary, but if you want to process it in any way, then custom classes will be needed. It just depends whether the OP intends to merely store them on there or process them in any way. – Quetzalcoatl May 14 '13 at 15:02
  • @Quetzalcoatl : You could use "SequenceFileInputFormat" to process the files. I don't find the need of a custom InputFormat for this. – Tariq May 14 '13 at 15:05
  • Okey nice examples end explanation – devosJava May 14 '13 at 15:05
  • how could retrieve the file. I am looking to store and retrieve the binary files to check the up/down transfer time. Do i need to write map reduce code or just using the command i could do? – Josh Jun 03 '14 at 18:46