Little-endian convention, and saving to a binary file

Question

I have a matrix (2-D int pointer int **mat) that I am trying to write to a file in Linux in Little-endian convention.

Here is my function that writes to the file:

#define BUFF_SIZE 4
void write_matrix(int **mat, int n, char *dest_file) {
    int i, j;
    char buff[BUFF_SIZE];
    int fd = open(dest_file, O_CREAT | O_WRONLY, S_IRUSR | S_IWUSR | S_IXUSR);

    if (fd < 0) {
        printf("Error: Could not open the file \"%s\".\n", dest_file);
    }

    buff[0] = (n & 0x000000ff);
    buff[1] = (n & 0x0000ff00) >> 8;
    buff[2] = (n & 0x00ff0000) >> 16;
    buff[3] = (n & 0xff000000) >> 24;

    write(fd, buff, BUFF_SIZE);

    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            buff[0] = (mat[i][j] & 0x000000ff);
            buff[1] = (mat[i][j] & 0x0000ff00) >> 8;
            buff[2] = (mat[i][j] & 0x00ff0000) >> 16;
            buff[3] = (mat[i][j] & 0xff000000) >> 24;

            if (write(fd, buff, BUFF_SIZE) != BUFF_SIZE) {
                close(fd);
                printf("Error: could not write to file.\n");
                return;
            }
        }
    }

    close(fd);
}

The problem is that when I write out a matrix large enough of the form mat[i][i] = i (let's say 512 X 512), I think I get an overflow, since I get weird negative numbers.

To convert back I use:

void read_matrix(int fd, int **mat, int n, char buff[]) {
    int i, j;

    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            assert(read(fd, buff, BUFF_SIZE) == BUFF_SIZE);
            mat[i][j] = byteToInt(buff);
        }
    }
}

int byteToInt(char buff[]) {
    return (buff[3] << 24) | (buff[2] << 16) | (buff[1] << 8) | (buff[0]);
}

What an I doing wrong?

EDITED:

Added the read_matrix function.
It seems like I'm getting a short instead on an int, since 384 = (110000000) becomes -128 = (bin) 1000000
Did a test, and found out that:

char c = 128; int i = 0; i |= c;

gives i = -128. Why????

Most likely not the problem but, your error checking is weird. You write to the file before checking to see if fd is less then zero. — FDinoff, Dec 25 '14 at 01:26
@FDinoff you are right about the check for successful openning... Fixing! — Tomer Amir, Dec 25 '14 at 01:31
Show us how you wrote the "read matrix" part as well (the entire thing). — Rufflewind, Dec 25 '14 at 01:31
It may not apply in this case, but it's safer to shift first, _THEN_ mask, to avoid sign-extension effects. It also makes the code more readable since you're always extracting the lowest byte, and a bit more compact since you don't need the additional constants. — keshlam, Dec 25 '14 at 01:54
If you're spamming raw octets in/out, I strongly advise you use `unsigned char`, not `char`, for your buffers. The last thing you need is an integer promotion from a `char` that is signed when you didn't intend it. — WhozCraig, Dec 25 '14 at 02:04
@keshlam I tried it, but it didn't work either. I think I found the problem, but I still can't explain/fix it. See the Edit please... — Tomer Amir, Dec 25 '14 at 02:11
Regarding your update, `char c = 128` is already implementation-defined if `char` is 8 bits and signed. Valid values for `char` (assuming yours is signed, and I have no evidence to suggest otherwise) is `-128 .. 127`. You're overflowing your `char` data type. 128 is `0x80`, which is `-128` as a signed char if your implementation so chooses (and it did). That value is then prompted to `int` for the `i|=c` calculation and sign extended, giving you `i=-128` for your final result. I meant what I said earlier; don't use `char` for this stuff, used `unsigned char`. — WhozCraig, Dec 25 '14 at 02:22
don't do endian transform by hand. it's redundant:http://man7.org/linux/man-pages/man3/endian.3.html．　and your code is extremely error prone. `int` is not necessary to be 4 byte long. — Jason Hu, Dec 25 '14 at 02:27
@WhozCraig +1 what the OP has is not only implementation-defined behaviour but also undefined behaviour due to left shifting of signed, negative values. — legends2k, Dec 25 '14 at 03:17

rici · Accepted Answer · 2014-12-25T05:35:54.723

The problem is in your input conversion:

int byteToInt(char buff[]) {
    return (buff[3] << 24) | (buff[2] << 16) | (buff[1] << 8) | (buff[0]);
}

You don't mention which platform you are on, but on most common platforms char is signed. And that will cause problems. Suppose, for example, that buff[1] is 0x80 (0b1000000). Since it is a signed value, that is the code for the value -128. And since shift operators start by doing integer promotions on both of their arguments, that will be converted to the integer -128 before the shift operation is performed; in other words, it will have the value 0xFFFFFF80, which will become 0xFFFF8000 after the shift.

The bitwise logical operators (such as |) perform the usual arithmetic conversions before doing the bitwise operations; in the case of (buff[1] << 8) | (buff[0]), the left-hand operator will already be a signed int (because the type of << is the type of its promoted left-hand argument); the right-hand argument, an implicitly signed char, will also be promoted to a signed int, so again if it were 0x80, it would end up being sign-extended to 0xFFFFFF80.

In either case, the bitwise-or operation will end up with unwanted high-order 1 bits.

Explicitly casting buff[x] to an unsigned int won't help, because it will first be sign-extended to an int before being reinterpreted as an unsigned int. Instead, it is necessary to cast it to an unsigned char:

int byteToInt(char buff[]) {
    return   ((unsigned char)buff[3] << 24)
           | ((unsigned char)buff[2] << 16)
           | ((unsigned char)buff[1] << 8)
           | (unsigned char)buff[0];
}

Since int may be 16-bit, it would be better to use long, and indeed it would be better to use unsigned long to avoid other conversion issues. That means doing a double cast:

unsigned long byteToInt(char buff[]) {
    return   ((unsigned long)(unsigned char)buff[3] << 24)
           | ((unsigned long)(unsigned char)buff[2] << 16)
           | ((unsigned long)(unsigned char)buff[1] << 8)
           | (unsigned long)(unsigned char)buff[0];
}

`(unsigned char)buff[3] << 24` is not portable. `unsigned char` promotes to `int`. C specifies `int` as at _least_ 16 bits. If `int` was 16-bit, then shifting `int` 24 is undefined. But then OP appears to be assuming (wrong) `int` is at least 32-bit. — chux - Reinstate Monica, Dec 25 '14 at 03:50
@chux: I believe it's theoretically non-portable even if `int` is 32-bit because there is no guarantee that left shifting a positive signed int will result in the expected negative signed int (although the result is unspecified, not undefined.) But that swamp is too deep to wade in on a holiday. And do have a happy one :) — rici, Dec 25 '14 at 05:34
From the swampy abyss the ooze drips from a head nodding and murmuring "Yes, shift UB, muddy - very muddy" as he wanders back to his lair and pins up his stocking. (nice hat) — chux - Reinstate Monica, Dec 25 '14 at 06:06

score 1 · Answer 2 · edited May 23 '17 at 12:23

What you have is an undefined behaviour often overlooked. Left shifting of signed negative values is undefined. See here for details.

When you do this

int byteToInt(char buff[]) {
    return (buff[3] << 24) | (buff[2] << 16) | (buff[1] << 8) | (buff[0]);
}

even if one element of buff has a negative value (I.e. one of the binary data's value sets the MSB) then you hit undefined behaviour. Since your data is binary, reading it as unsigned makes the most sense. You could use a standard type which makes the signedness and length explicit, such as uint8_t from stdint.h.

Little-endian convention, and saving to a binary file

2 Answers2