Proper way to read binary file in C++?

Question

I have been search on the internet for a way to read binary files in c++, and I have found two snippets that kind of works:

No.1:

#include <iostream>
#include <fstream>

int main(int argc, const char *argv[])
{
   if (argc < 2) {
      ::std::cerr << "Usage: " << argv[0] << "<filename>\n";
      return 1;
   }
   ::std::ifstream in(argv[1], ::std::ios::binary);
   while (in) {
      char c;
      in.get(c);
      if (in) {
         // ::std::cout << "Read a " << int(c) << "\n";
         printf("%X ", c);
      }
   }
   return 0;
}

Result:

6C 1B 1 FFFFFFDC F FFFFFFE7 F 6B 1

No.2:

#include <stdio.h>
#include <iostream>

using namespace std;

// An unsigned char can store 1 Bytes (8bits) of data (0-255)
typedef unsigned char BYTE;

// Get the size of a file
long getFileSize(FILE *file)
{
    long lCurPos, lEndPos;
    lCurPos = ftell(file);
    fseek(file, 0, 2);
    lEndPos = ftell(file);
    fseek(file, lCurPos, 0);
    return lEndPos;
}

int main()
{
    const char *filePath = "/tmp/test.bed";
    BYTE *fileBuf;          // Pointer to our buffered data
    FILE *file = NULL;      // File pointer

    // Open the file in binary mode using the "rb" format string
    // This also checks if the file exists and/or can be opened for reading correctly
    if ((file = fopen(filePath, "rb")) == NULL)
        cout << "Could not open specified file" << endl;
    else
        cout << "File opened successfully" << endl;

    // Get the size of the file in bytes
    long fileSize = getFileSize(file);

    // Allocate space in the buffer for the whole file
    fileBuf = new BYTE[fileSize];

    // Read the file in to the buffer
    fread(fileBuf, fileSize, 1, file);

    // Now that we have the entire file buffered, we can take a look at some binary infomation
    // Lets take a look in hexadecimal
    for (int i = 0; i < 100; i++)
        printf("%X ", fileBuf[i]);

    cin.get();
    delete[]fileBuf;
        fclose(file);   // Almost forgot this
    return 0;
}

Result:

6C 1B 1 DC F E7 F 6B 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A1 D 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The result of xxd /tmp/test.bed:

0000000: 6c1b 01dc 0fe7 0f6b 01                   l......k.

The result of ls -l /tmp/test.bed

-rw-rw-r-- 1 user user 9 Nov  3 16:37 test.bed

The second method is giving the right hex codes in the beginning but seems got the file size wrong, the first method is messing up the bytes.

These methods look very different, perhaps there are many ways to do the same thing in c++? Is there an idiom that pros adopt?

Here is some more explanation on this question: http://stackoverflow.com/questions/22054759/stdistreambuf-iteratorchar-initializes-with-no-arguments/22054961?noredirect=1#22054961 — qed, Feb 26 '14 at 22:41

score 1 · Accepted Answer · edited Nov 04 '13 at 11:42

1

You certainly want to convert the char objects to unsigned char before processing them as integer values! The problem is that char may be signed in which case negative values get converted to negative ints when you cast them. Negative ints displayed as hex will have more then two hex digits, the leading ones probably all "f".

I didn't immediately spot why the second approach gets the size wrong. However, the C++ approach to read a binary file is simple:

#include <iostream>
#include <fstream>
#include <vector>
#include <iomanip>

std::vector<unsigned char> bytes;
{
    std::ifstream in(name, std::ios_base::binary);
    bytes.assign(std::istreambuf_iterator<char>(in >> std::noskipws),
                 std::istreambuf_iterator<char>());
}
std::cout << std::hex << std::setfill('0');
for (int v: bytes) {
    std::cout << std::setw(2) << v << ' ';
}

edited Nov 04 '13 at 11:42

qed

22,298
21
125
196

answered Nov 03 '13 at 22:24

Dietmar Kühl

150,225
13
225
380

It's a little confusing that c++ uses bitwise shift operators to specify an option. – qed Nov 03 '13 at 22:42
I find that code quite verbose. And, it permanently modifies the output format of `std::cout`. How would the code look if the formatting were restored? – Roland Illig Nov 03 '13 at 22:46
It looks like `ios_base` is a subclass of `ios`, what's the difference between `ios::binary` and `ios_base::binary`? Or, maybe I got it wrong, `ios` is a subclass of `ios_base` and hence inherits `binary`? – qed Nov 03 '13 at 22:57
@RolandIllig: Given that the output of an entire binary file is quite beefy, I'd think something like `std::ostream fmt(0); fmt.copyfmt(std::cout); ...; std::cout.copyfmt(fmt);` could be reasonable. Since formatting flags are a local facility I don't think restoring them is necessary (stdio doesn't even have a concept of sticky formatting flags). – Dietmar Kühl Nov 03 '13 at 22:58
@qed: The various flags are defined in `std::ios_base` which is inherited by `std::basic_ios<...>` (the type `std::ios` is a `typedef` for `std::basic_ios`). I like to use names of the classes where the entities are actually defined. If nothing else, it avoid opening the documentation of places where the entities are not defined (`std::ios` -> see `std::basic_ios<...>` -> see `std::ios_base` => two steps omitted directly pointing at `std:ios_base`). – Dietmar Kühl Nov 03 '13 at 23:01
@DietmarKühl Why do you surround the 2nd and the 3rd statements with braces? – qed Nov 03 '13 at 23:02
@DietmarKühl: I don't believe that the formatting flags are a “local facility”. I rather think they are in effect until some other formatting overrides them. – Roland Illig Nov 03 '13 at 23:05
@RolandIllig: Sure, they are (the only formatting flag which is different is `width()` which gets reset to `0` upon each use). However, I find I'm better off setting the formatting flags needed/desired at the point of use. As a result, I don't bother with restoring flags which were previously set: I'd just restore a setup which is likely to be somewhat random anyway and which is bound to be changed before being reused anyway. – Dietmar Kühl Nov 03 '13 at 23:08
@RolandIllig: Actually, thinking a bit harder about the resetting of formatting flags: I guess, I wouldn't really deal with `copyfmt()` (although it would work). Instead, if I'm not allowed to change the formatting flags, I'd just use a temporary stream: `std::ostream tmpout(out.rdbuf()); tmpout << std::hex << std::setfill('0'); /* use tmpout */` The destination is the same as `out` but the formatting is kept entirely separate. – Dietmar Kühl Nov 03 '13 at 23:11
@DietmarKühl: so you want to tell me that `std::hex` is “local” to something? Local to what? It certainly affects any later output. – Roland Illig Nov 03 '13 at 23:18
1

@RolandIllig: As said: yes, the formatting flags set will stay they way they are. However, if you care about how things are formatted, you'd better set the formatting flags locally to the specific needs: the formatting flags already set are some random combination which was useful wherever they were last set. The fact that the formatting flags are sticky is a by-product of how formatting flags are implemented in IOStreams, it isn't intended that you set up flags globally and use them! You set the formatting flags according to your local needs. – Dietmar Kühl Nov 03 '13 at 23:27
bytes is a vector of unsigned char, but in the for loop you said `int v: bytes`, why? – qed Nov 04 '13 at 13:58
I see, it's for formatting, but why can't we just print the unsigned char? – qed Nov 04 '13 at 14:00
@qed: printing `unsigned char`s would format the values as characters rather than as integers. That is `std::cout << static_cast('A');` still prints as `'A'` rather than 65. – Dietmar Kühl Nov 04 '13 at 14:07
What if I only want to read the first 3 bytes of a file? It would be very inefficient to iterate over the whole file. – qed Nov 04 '13 at 17:53

Matteo Italia · Answer 2 · 2013-11-04T02:36:53.030

1

Both your methods are some strange mix of C and C++ (well, actually the second is just plain C); still, the first method is mostly right, but you have to use an unsigned char for c, otherwise any byte over 0x7f is read as negative, which results in that wrong output.¹

To do things correctly and in the "C++ way", you should have done:

std::cout<<std::hex<<std::setfill('0');

...

   if (in)
      std::cout << std::setw(2)<<int(c) << "\n";

The second one gets the "signedness" correct, but it's mostly just C. A quick fix would be to fix the 100 in the for loop, replacing it with fileSize. But in general, loading the whole file in memory just to dump its content in hexadecimal is a botched idea; what you normally do is to read the file a piece at time in a fixed-size buffer and convert it by the by.

get returns an int; if it's bigger than 0x7f it overflows the char when assigning, and typically results in some negative value. Then when it is passed to printf it gets sign-extended (since any signed integer parameter passed to a vararg function is widened to int) but interpreted as an unsigned int due to the %X parameter. (all this assuming 2's complement arithmetic, non-signaling integer overflow and signed char)

edited Nov 04 '13 at 02:36

answered Nov 03 '13 at 22:30

Matteo Italia

123,740
17
206
299

Although you can certainly use `setf()` to set the various flags for the fields, I think it is _a lot_ easier to use the respective manipulators, e.g., `std::cout << std::hex;`. It is nearly still easier to use the manipulator with function call notation `std::cout.operator<< (std::hex)` than using `setf()`. – Dietmar Kühl Nov 03 '13 at 23:05
1

@DietmarKühl: I tend to avoid stream manipulators, I never got my head around which of them are "sticky", and I've been bit several times by this stuff. Now that I look it up, finally [it seems that all of them are sticky](http://stackoverflow.com/questions/1532640/which-iomanip-manipulators-are-sticky), but `width` gets reset at random, go figure. These are the reasons why, if possible, I try to steer off std streams altogether, their design is riddled with flaws, especially of the "badly implemented ambitious ideas" category. – Matteo Italia Nov 04 '13 at 02:27
1

(#1 sin: hoping that the `<<` syntax with manipulators can be a decent replacement for printf-style, "format string" formatting; hint: it's not) – Matteo Italia Nov 04 '13 at 02:34

score 0 · Answer 3 · answered Nov 03 '13 at 22:24

0

In the first case you're printing char (which is signed) while in the second case you're doing the same with unsigned char. %X extends chars to ints and that causes the difference.

answered Nov 03 '13 at 22:24

dnk

661
4
5

qed · Answer 4 · 2013-11-06T22:51:36.503

In a search for why @Roland Illig 's answer (now deleted) does not work, I found the following solution, not sure if it's up to the professional standard, but it gives right results so far, and allows to check the beginning n-bytes of a file:

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <string>


int main(int argc, const char *argv[])
{
    if (argc < 3) {
        ::std::cerr << "usage: " << argv[0] << " <filename>\n";
        return 1;
    }

    int nbytes = std::stoi(argv[2]);
    char buffer[nbytes];
    std::streamsize size = nbytes;

    std::ifstream readingFile(argv[1], std::ios::binary);
    readingFile.read(buffer, (int)size);
    std::streamsize bytesread = readingFile.gcount();
    unsigned char rawchar;
    if (bytesread > 0) {
        for (int i = 0; i < bytesread; i++) {
            rawchar = (unsigned char) buffer[i];
            printf("%02x ", (int) rawchar);
        }
        printf("\n");
    }

    return 0;
}

Another answer I got from wibit.com :

#include <iostream>
#include <fstream>
using namespace std;

int main(int argc, const char* argv[])
{
  ifstream inBinaryFile;
  inBinaryFile.open(argv[1], ios_base::binary);
  int currentByte = inBinaryFile.get();
  while(currentByte >= 0)
  {
    printf("%02x ", currentByte);
    currentByte = inBinaryFile.get();
  }
  printf("\n");
  inBinaryFile.close();
  return 0;
}

Proper way to read binary file in C++?

4 Answers4