1

I have written some binary image data to a Hadoop SequenceFile and would like to write it out as a PNG outside of Hadoop, if possible, using Java.

[Edited] Overview of the data flow: Input files → Generate BufferedImages from input → Convert BufferedImages into binary arrays → Store as SequenceFile in HDFS → Trying to take the SequenceFile outside of HDFS and convert it into PNG.

However, I am not sure of how to locate where the data starts inside the SequenceFile. From what I have seen of the SequenceFile documentation, I can use the sync marker to locate the end of the SequenceFile header, and then use the record length and key length information to find the beginning of the value.

However, I am unsure of how to find where the sync marker is. How would I find where the header's metadata stops and where the sync marker begins and ends? Would it be possible for me to calculate the value of the sync marker and look for it that way? Also, how can I find out the number of bytes the record length and key length take up?

If there are alternative ways of finding the SequenceFile value, please let me know. If it helps, here is a little bit of code that I used to write to the SequenceFile.

baos = new ByteArrayOutputStream(); 
ImageIO.write(img, "png", baos); //img is a BufferedImage
byte[] imBytes = baos.toByteArray();
baos.write(imBytes);
writer = SequenceFile.createWriter(conf, writer.file(new Path(imgPath)), writer.keyClass(Text.class),writer.valueClass(BytesWritable.class));
writer.append(new Text(imgPath), new BytesWritable(imBytes));

Essentially I took a BufferedImage generated by the program, wrote it to a byte array as a PNG, then wrote it to SequenceFile.

[Edit] I've looked through the SequenceFile source code and there is a function called getSync()? I think it is private though so I'm not sure how I'd use it.

dcs
  • 31
  • 4
  • So whats wrong with using `SequenceFile.Reader` (as seen in your link)? Or check this SO https://stackoverflow.com/questions/7560515/how-can-i-inspect-a-hadoop-sequencefile-for-which-i-lack-full-schema-information – mazaneicha Jul 14 '20 at 15:28
  • @mazaneicha Would I be able to use the SequenceFile.Reader to convert the SequenceFile into a PNG and store it on HDFS? – dcs Jul 14 '20 at 15:44
  • Not sure if this clarifies anything, but essentially what I was trying to do was extract image data stored in a binary file (which had a bunch of stuff other than the image data as well) and store it as a PNG in HDFS. Then from what I looked up, it seemed like you couldn't directly write to a PNG in HDFS, so I decided to write to a SequenceFile instead, and have a Java program process the SequenceFile into PNG. My issue though is that I wasn't sure how to process it, and when I looked at the SequenceFile I wasn't sure where the actual value/image data started. – dcs Jul 14 '20 at 15:51
  • SequenceFile is just a seq of key/values where value is the "original" file content. You can read it as text or binary, you can also inquire key and value types if not known upfront... Its all in that API you were referring to. – mazaneicha Jul 14 '20 at 16:03
  • @mazaneicha Ok so I'm starting to see why my question is confusing and a little circular. I think I might be asking the wrong question here. Basically, I want to write to a PNG, and I've been able to write a PNG locally but not to HDFS, so I put the data into a SequenceFile. However, I still need the PNG, because I have another program written in pure Java that operates on the PNG. I initially wanted to have a Java program to run after my Hadoop job to extract the SequenceFile value and output it as a PNG, but in retrospect I still wouldn't be able to write it directly to HDFS. – dcs Jul 14 '20 at 16:22
  • I do still think it would be helpful for my understanding to know how to find the value in SequenceFile though. I've been looking into the source code and I see that there are variables sync and private method getSync(), though I'm not sure if I would be able to use that. – dcs Jul 14 '20 at 16:24
  • Why were you not able to write a PNG to HDFS? Maybe its better tackle the original problem. – mazaneicha Jul 14 '20 at 16:42
  • HDFS doesn't seem to support directly writing to a PNG, from what I've read, which is why many say to use SequenceFiles instead. https://stackoverflow.com/questions/20627165/how-to-write-buffered-image-on-hdfs and https://stackoverflow.com/questions/16546040/store-images-videos-into-hadoop-hdfs – dcs Jul 14 '20 at 17:17

0 Answers0