1

How do I read a binary file line by line and store the contents in C

I have a binary file called add_num.mi.

Now I am able to inspect the contents of the file by running the command

xxd -c 4 add_nums.mi | head -n 9

Now this contains the following binary information

00000000: 1301 f07f  ....
00000004: ef00 c000  ....
00000008: b717 0000  ....
0000000c: 2386 0780  #...
00000010: b717 0000  ....
00000014: 1307 8004  ....
00000018: 2380 e780  #...
0000001c: 1305 0000  ....
00000020: 6780 0000  g...

How would I print this out in C code programmatically.

At the moment this is the attempt I have made but it does not work correctly.

#include <stdio.h>

int main() {
    FILE *fp;
    unsigned char ch;

    fp = fopen("examples/add_2_numbers/add_2_numbers.mi", "rb");
    if (fp == NULL) {
        printf("Error opening file\n");
        return 1;
    }

    while ((ch = fgetc(fp)) != EOF) {  // read file character by character
        printf("%c", ch);
    }

    fclose(fp);
    return 0;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 3
    Please define _"it does not work correctly"_. Also: you're printing binary data with `%c`: not all bytes are _printable_ characters. – Adriano Repetti Mar 16 '23 at 09:45
  • 4
    "lines" and binary files don't really go together. – Shawn Mar 16 '23 at 09:51
  • 1
    Welcome to SO. If you are dealing with binary data, you should not use character or string functions to read from it. You could better us `fread` to read blocks of data. – Gerhardh Mar 16 '23 at 09:51
  • 4
    Also, `fgetc` returns an `int`, not a `char` or even an `unsigned char`. How would an `unsigned char` ever be able to properly hold `EOF` which is a negative value? – Gerhardh Mar 16 '23 at 09:52
  • Anyway, you cannot print non printable characters with `%c`, but only characters in the range 32 to 127 (assuming you're on a system with ASCII encoding which is is certainly (99.99%) the case). For all non printable characters print a dot `'.'`. – Jabberwocky Mar 16 '23 at 10:15
  • 1
    Are you trying to read the binary file to reproduce the output of the `xxd` command, or is there some data in the file you want to read and use? To interpret the contents of a binary file, you need to know the format of the data in it. What is the format of the binary file? – Eric Postpischil Mar 16 '23 at 10:36
  • Use `fread()` instead of `fgetc()`. – dimich Mar 16 '23 at 10:42
  • `"How do I read a binary file line by line"` -- Please explain what you mean with the word "line". In the context of a text file, a line is a sequence of characters delimited by a newline (`'\n'`) character. Is that also what you mean when you use the word "line" in the context of a binary file? Or do you maybe mean a fixed number of bytes? – Andreas Wenzel Mar 16 '23 at 12:22
  • Note that reading a binary file, if the file contains the bytes from a `struct` of variables, can by done by reading and copying the bytes right into a packed struct. Then, you just access the members of the struct to read the values from them. It would be the opposite of the 3 answers I present [here](https://stackoverflow.com/q/69983795/4561887), for example. Here is a link to just my 3rd answer [Answer 3/3: use a packed struct and a raw `uint8_t` pointer to it](https://stackoverflow.com/a/69984614/4561887). Just learn those concepts and then go in the opposite direction. – Gabriel Staples Mar 17 '23 at 23:20

2 Answers2

0

As has been mentioned in the comments, when reading binary data, you need to know the exact format of the data.

One way or another, you need to read individual bytes making up each data field, and reassemble those bytes into an actual value.

For example, if the data consists solely of 16-bit (2-byte) integers in "big-endian" byte order, you could use something like this:

while (1) {
    unsigned char c1 = getc(fp);
    unsigned char c2 = getc(fp);
    if(feof(fp)) break;
    int x = (c1 << 8) | c2;
    printf("%04x\t%d\n", x, x);
}

This code literally reads two bytes c1 and c2, and assembles them into an integer value x, prints it, and repeats. When I run this I get:

1301    4865
f07f    61567
ef00    61184
c000    49152
b717    46871
....

and the connection with the xxd dump is obvious.

If the integers are little-endian, the change is simple:

int x = (c2 << 8) | c1;

This gives

0113    275
7ff0    32752
00ef    239
...

and you can see that the bytes have been swapped.

If the 16-bit integers are signed, and if type int on your machine is bigger than 16 bits, you may need to perform some explicit sign extension. That might look like this:

if(x & 0x8000) x |= 0xffff0000;

Finally, if the file contains 32-bit (4-byte) integers, you could use something more like

while (1) {
    unsigned char c1 = getc(fp);
    unsigned char c2 = getc(fp);
    unsigned char c3 = getc(fp);
    unsigned char c4 = getc(fp);
    if(feof(fp)) break;
    int x = (((((c4 << 8) | c3) << 8) | c2) << 8) | c1;
    printf("%x\t%d\n", x, x);
}

(This code is arguably imperfect in that it quietly assumes that type int is 32 bits.)

There are a number of other details and subtleties to get right, but these code fragments should at least get you started.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • 1
    I'm afraid this example is not a good illustration. The code would have undefined behavior upon read errors. Why not use `fread()` or check for `EOF` the proper way? – chqrlie Mar 17 '23 at 19:41
  • Steve, I agree with @chqrlie. chqrlie is right. `getc()` returns an `int`, not an `unsigned char`, specifically so you can check it for errors first. You need to do something like this instead: `int retcode = getc(fp); if (retcode == EOF) { // handle error } unsigned char c1 = retcode;` etc. Also, `fgetc()` is preferred over `getc()` because they do the same thing but `getc()` could be written as a macro, which can have side effects if the input is an expression. See [here](https://en.cppreference.com/w/c/io/fgetc). – Gabriel Staples Mar 17 '23 at 19:50
  • Upvoting because this answer is generally useful, though. – Gabriel Staples Mar 17 '23 at 19:53
  • @GabrielStaples: `getc()` or `fgetc()` offer very similar performance on most systems now, but passing an expression with side effects as the `FILE*` argument is a recipe for disaster in both cases. – chqrlie Mar 17 '23 at 19:57
  • @GabrielStaples IMO, the chance that `getc`'s argument could have side effects is so remote that it's not worth worrying about. Also I'm not sure (but don't have time to research just now) what chqrlie means by UB — perhaps it's because getc would return `EOF` on error, but `feof` won't catch it, in which case the fix is obvious. (Also I trust it's obvious why, in this case, I'm neglecting `getc`'s error return.) – Steve Summit Mar 17 '23 at 19:57
  • I know of *one* case where `putc`'s argument might have a side effect, namely an implementation of [`tee(1)`](https://linux.die.net/man/1/tee), and if someone invokes it with multiple filename arguments. I can't imagine one for `getc`. – Steve Summit Mar 17 '23 at 19:58
  • @SteveSummit: `putc()` is a different case. C library authors are well advised to not multi-evaluate the first argument. `putc(c, file[i++])` is horrible an unnecessary: just use a classic and readable `for` loop: `for (int i = 0; i < nb_streams; i++) { putc(c, file[i]); }` – chqrlie Mar 17 '23 at 20:00
  • @SteveSummit: yes the fix is obvious: `if (feof(fp) || ferror(fp)) break;` but reading the bytes with `fread()` is simpler: `while (fread(buf, sizeof buf, 1, fp)) ...` – chqrlie Mar 17 '23 at 20:04
0

The xxd tool outputs the file contents in 4 columns:

  • the file offset in hex with at least 8 digits
  • the bytes as 2 hex digits, in 2 columns
  • the bytes as characters if printable, otherwise as a .

You can produce this output with this modified version:

#include <stdio.h>

int main(void) {
    FILE *fp = fopen("examples/add_2_numbers/add_2_numbers.mi", "rb");
    if (fp == NULL) {
        printf("Error opening file\n");
        return 1;
    }

    int width = 4;          // number of bytes per line
    int pos = 0;            // byte number in the current line
    char line[width + 1];   // byte dump buffer
    long int offset = 0;    // file offset
    int ch;                 // byte read or EOF
    int lineno = 0;         // line number
    int max_lines = 9;      // maximum number of lines (0 for none)

    while ((ch = fgetc(fp)) == EOF) {
        if (pos == 0) {
            printf("%08lx: ", offset);
        }
        offset++;
        printf("%02x", ch);
        line[pos++] = (ch >= 0x20 && ch < 0x7f) ? ch : '.';
        if (pos == width / 2) {
            printf(" ");
        }
        if (pos == width) {
            line[pos] = '\0';
            printf("  %s\n", line);
            pos = 0;
            lineno++;
            if (lineno == max_lines)
                break;
        }
    }
    if (pos > 0) {
        int pad = 2 * (width - pos) + (pos < width / 2) + 2;
        line[pos] = '\0';
        printf("%*s\n", pad + pos, line);
    }

    fclose(fp);
    return 0;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189