Extracting a binary file from other file encoding\conversion mistake

Question

I have two binary files, "bigFile.bin" and "smallFile.bin".
The "bigFile.bin" contains "smallFile.bin".
Opening it in beyond compare confirms that.

I want to extract the smaller file form the bigger into a "result.bin" that equals "smallFile.bin".
I have two keywords- one for the start position ("Section") and one for the end position ("Man");

I tried the following:

   byte[] bigFile = File.ReadAllBytes("bigFile.bin");
   UTF8Encoding enc = new UTF8Encoding();
   string text =  enc.GetString(bigFile);

   int startIndex = text.IndexOf("Section");
   int endIndex = text.IndexOf("Man");

   string smallFile = text.Substring(startIndex, endIndex - startIndex);

   File.WriteAllBytes("result.bin",enc.GetBytes(smallFile));

I tried to compare the result file with the origin small file in beyond compare, which shows hex representation comparison.
nost of the bytes areequal -but some not.

For example in the new file I have 84 but in the old file I have EF BF BD sequence instead.

What can cause those differences? Where am I mistaken?

score 0 · Accepted Answer · edited May 23 '17 at 11:56

Since you are working with binary files, you should not use text-related functionality (which includes encodings etc). Work with byte-related methods instead.

Your current code could be converted to work by making it into something like this:

   byte[] bigFile = File.ReadAllBytes("bigFile.bin");

   int startIndex = /* assume we somehow know this */
   int endIndex = /* assume we somehow know this */

   var length = endIndex - startIndex;
   var smallFile = new byte[length];
   Array.Copy(bigFile, startIndex, smallFile, 0, length);
   File.WriteAllBytes("result.bin", smallFile);

To find startIndex and endIndex you could even use your previous technique, but something like this would be more appropriate.

However this would still be problematic because:

Stuffing both binary data and "text" into the same file is going to complicate matters
There is still a lot of unnecessary copying going on here; you should work with your input as a Stream rather than an array of bytes
Even worse than the unnecessary copying, any non-stream solution would either need to load all of your input file in memory as happens above (wasteful), or be exceedingly complicated to code

So, what to do?

Don't read file contents in memory as byte arrays. Work with FileStream instead.
Wrap a StreamReader around the FileStream and use it to find the markers for the start and end indexes. Even better, change your file format so that you don't need to search for text.
After you know startIndex and length, use stream functions to seek to the relevant part of your input stream and copy length bytes to the output stream.

Extracting a binary file from other file encoding\conversion mistake

1 Answers1