Extract multiple JPEG's from a single file

Question

I have 100's of files that have some data at the top then a series of images at the bottom. I need to read this data using either C# or VB and then write the individual images to a file. Here is a example of what the file looks like in Notepad++: https://i.stack.imgur.com/DOEJE.png

I need to read all the data at top as well as the images. Any help or examples would be appreciated.

Do you know the boundaries of the specific files? e.g. are there always 12 lines of text? — Sebb, Jan 26 '15 at 22:07
I do not, but i do know that each images starts with ЄЂ Ŧ @ ,Ѐ Vjpeg appl Ё, H H Ō Photo - JPEG ؿJFIF H H ﾀAppleMark Ā — Devon Quick, Jan 26 '15 at 22:12

score 1 · Accepted Answer · edited May 23 '17 at 10:26

First of all the approach of this may differ depending on the file structure, but assuming that you know the boundaries of each section in the file or have some kind of binary data stored in the file to indicate the actual length of sections etc. IMO it would be way better if you didn't have the text stored like a normal text file by line, but rather as binary data. The BinaryReader / BinaryWriter classes (From System.IO) will solve this problem the best, unless all the sections in your file have static sizes, then you can just use File.ReadAllBytes() and simple copy the bytes from the byte array associated with the file. However assuming the sections have dynamic sizes then you might want to use something like this:

using (var fs = new FileStream("yourfile.bin", FileMode.Open))
{
    using (var br = new BinaryReader(fs))
    {
        int sections = br.ReadInt32();
        for (int i = 0; i < sections; i++)
        {
            int sectionLength = br.ReadInt32();
            byte[] sectionData = br.ReadBytes(sectionLength);

            // Use the data however you want ...
            // A good idea would be to check whether it's text or an image
        }
    }
}

Which equals a file structure of the following:

4 bytes (int) for the amount of section

Each section would represent the following structure:

4 bytes (int) OR 8 bytes (long) if the images are big
byte[] DataBytes (This will either be the bytes of text or the bytes of images)

The same goes for writing to the actual file. Everytime you write data to the file you specify the size of the data before writing it. This approach is also safer in the end.

Note: You could validate the data by either checking if the data has an image header or create your own kind of data header ex. 1 or 2 bytes for the type. I'd sugges 2 bytes to have proper padding. This could be an enum like the following:

enum DataType : short
{
    Text = 0,
    Image = 1
}

Then before reading the section data you read the type like:

var type = (DataType)br.ReadInt16();

This also makes it possible to easily expand the file structure with new different data structures. Ex. you could implement other things than just text and images, such as audio files, videos, other binary files etc.

If you have no knowledge of any of the data apart from maybe that the images have image headers then you might just want to compare the bytes and check for matching image headers. This may fail or may not as image headers may differ + you have no exact knowledge of the image data stored (Unless you actually read some of the header and gather the image boundaries then you could figure out how many bytes to read by logic. This is different depending on the image types ex. JPG, PNG, GIF etc. You could take a look at this: Getting image dimensions without reading the entire file

Good answer :) This explains it pretty much, so I'll just note [this thread about detecting jpg's](http://stackoverflow.com/questions/772388/c-sharp-how-can-i-test-a-file-is-a-jpeg), which states that the magic number of jpg's is `0xd8ffe0ff;`. So you'd need to read byte by byte and search for this number. If you have multiple images per file, you could also load the first after detecting it and then use its size as offset. — Sebb, Jan 26 '15 at 22:29
Thanks for the response! i will give this a try. I would have definitely designed the data differently if it were me but this is a competitor software that i am converting data from so i need to work with it. :) — Devon Quick, Jan 26 '15 at 23:08
I get the following exception at int sectionLength = br.ReadInt32(); System.IO.EndOfStreamException was unhandled HResult=-2147024858 Message=Unable to read beyond the end of the stream. — Devon Quick, Jan 27 '15 at 15:11
You're trying to read 1 - 4 bytes out of the file. Make sure you're not reading out of the file's boundaries. Also make sure at whatever offset you're reading is actually a 32bit signed integer. Look at this https://msdn.microsoft.com/en-us/library/system.io.filestream.position%28v=vs.110%29.aspx — Bauss, Jan 27 '15 at 18:25

Extract multiple JPEG's from a single file

1 Answers1