Under what circumstances might C read() fail in a middle of a loop - after having started in the loop correctly?

Question

I just encountered this problem, and unfortunately, I cannot produce a reproducible example. I'll try to provide some debug data that I can see, and I hope I can get some help, primarily with the following:

Is my conclusion that read is failing sound?
If so, given the debug data, under what circumstances, would read fail?

The context is: I am building and ARM 32-bit binary as .elf file under MINGW64 on Windows. I made also a user-space "inspector" .c program, that I compile for Windows with the MINGW64 gcc compiler, helping me list the sections in the .elf file and retrieving some data from there - most of the heavy lifting there is done by the https://github.com/TheCodeArtist/elf-parser/ library.

So, I've been happily compiling new code in my .elf, and inspecting it with my elf-parser program, no problem for days.

Suddenly, I made a change in the .elf code, where I simply grouped some global variables into a struct - from that point on, the elf-parser program started failing - reporting null addresses.

I've made builds of the .elf before (pre_project.elf) and after (post_project.elf) this code change, and confirmed that the inspector user-space program - or rather, elf-parser library - fails on pre_project.elf, but works fine on post_project.elf; both of these files contain 33 .elf sections.

Looking deeper, I found that the original point of failure is the read_section_header_table in elf-parser.c; and I've added the following printout there:

void read_section_header_table(int32_t fd, Elf32_Ehdr eh, Elf32_Shdr sh_table[])
{
    uint32_t i;

    assert(lseek(fd, (off_t)eh.e_shoff, SEEK_SET) == (off_t)eh.e_shoff);

    for(i=0; i<eh.e_shnum; i++) {
        assert(read(fd, (void *)&sh_table[i], eh.e_shentsize)
                == eh.e_shentsize);
        printf("  i %d {sh_name = %d, sh_type = %d, sh_flags = %d, sh_addr = %d, sh_offset = %d, sh_size = %d, sh_link = %d, sh_info = %d, sh_addralign = %d, sh_entsize = %d}\r\n",
            i, sh_table[i].sh_name, sh_table[i].sh_type, sh_table[i].sh_flags, sh_table[i].sh_addr, sh_table[i].sh_offset, sh_table[i].sh_size, sh_table[i].sh_link, sh_table[i].sh_info, sh_table[i].sh_addralign, sh_table[i].sh_entsize
        );
    }

}

The function that I've used in my code, that eventually calls this function, is basically taken from main() in elf-parser-main.c; what they do there before calling this function is:

        sh_tbl = malloc(eh.e_shentsize * eh.e_shnum);
        if(!sh_tbl) {
            printf("Failed to allocate %d bytes\n",
                    (eh.e_shentsize * eh.e_shnum));
        }
        read_section_header_table(fd, eh, sh_tbl);

And I've checked that the eh.e_shentsize * eh.e_shnum is correct in both cases (section header size for 32-bit ELF is 40 bytes, and these files have 33 sections, so 1320 bytes), and malloc allocation does not trigger error - so that part should be fine.

Now, first, I've confirmed with readelf, that indeed both *.elf files are parseable by the usual tools:

$ arm-none-eabi-readelf -WS pre_project.elf | grep '^There\|^  \[[23]'
There are 33 section headers, starting at offset 0xff440:
  [20] .debug_info       PROGBITS        00000000 0219fa 04b346 00      0   0  1
  [21] .debug_abbrev     PROGBITS        00000000 06cd40 00c04c 00      0   0  1
  [22] .debug_loc        PROGBITS        00000000 078d8c 029b72 00      0   0  1
  [23] .debug_aranges    PROGBITS        00000000 0a2900 002208 00      0   0  8
  [24] .debug_ranges     PROGBITS        00000000 0a4b08 007570 00      0   0  8
  [25] .debug_line       PROGBITS        00000000 0ac078 0306da 00      0   0  1
  [26] .debug_str        PROGBITS        00000000 0dc752 00c469 01  MS  0   0  1
  [27] .debug_frame      PROGBITS        00000000 0e8bbc 005ed0 00      0   0  4
  [28] .stab             PROGBITS        00000000 0eea8c 00006c 0c     29   0  4
  [29] .stabstr          STRTAB          00000000 0eeaf8 0000e3 00      0   0  1
  [30] .symtab           SYMTAB          00000000 0eebdc 00b7f0 10     31 2229  4
  [31] .strtab           STRTAB          00000000 0fa3cc 004ee9 00      0   0  1
  [32] .shstrtab         STRTAB          00000000 0ff2b5 00018a 00      0   0  1

$ arm-none-eabi-readelf -WS build/post_project.elf | grep '^There\|^  \[[23]'
There are 33 section headers, starting at offset 0xff4e8:
  [20] .debug_info       PROGBITS        00000000 0219fa 04b3fc 00      0   0  1
  [21] .debug_abbrev     PROGBITS        00000000 06cdf6 00c05d 00      0   0  1
  [22] .debug_loc        PROGBITS        00000000 078e53 029b72 00      0   0  1
  [23] .debug_aranges    PROGBITS        00000000 0a29c8 002208 00      0   0  8
  [24] .debug_ranges     PROGBITS        00000000 0a4bd0 007570 00      0   0  8
  [25] .debug_line       PROGBITS        00000000 0ac140 0306da 00      0   0  1
  [26] .debug_str        PROGBITS        00000000 0dc81a 00c44a 01  MS  0   0  1
  [27] .debug_frame      PROGBITS        00000000 0e8c64 005ed0 00      0   0  4
  [28] .stab             PROGBITS        00000000 0eeb34 00006c 0c     29   0  4
  [29] .stabstr          STRTAB          00000000 0eeba0 0000e3 00      0   0  1
  [30] .symtab           SYMTAB          00000000 0eec84 00b7f0 10     31 2229  4
  [31] .strtab           STRTAB          00000000 0fa474 004ee9 00      0   0  1
  [32] .shstrtab         STRTAB          00000000 0ff35d 00018a 00      0   0  1

So, all looks good there.

Anyway, running "inspector.exe --elf-file pre_project.elf` results with this printout near the end of the loop:

$ inspector.exe --elf-file pre_project.elf
...
  i 24 {sh_name = 329, sh_type = 1, sh_flags = 0, sh_addr = 0, sh_offset = 674568, sh_size = 30064, sh_link = 0, sh_info = 0, sh_addralign = 8, sh_entsize = 0}
  i 25 {sh_name = 343, sh_type = 1, sh_flags = 0, sh_addr = 0, sh_offset = 704632, sh_size = 198362, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 0}
  i 26 {sh_name = 355, sh_type = 1, sh_flags = 48, sh_addr = 0, sh_offset = 902994, sh_size = 50281, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 1}
  i 27 {sh_name = 366, sh_type = 1, sh_flags = 0, sh_addr = 0, sh_offset = 953276, sh_size = 24272, sh_link = 0, sh_info = 0, sh_addralign = 4, sh_entsize = 0}
  i 28 {sh_name = 379, sh_type = 1, sh_flags = 0, sh_addr = 0, sh_offset = 977548, sh_size = 108, sh_link = 29, sh_info = 0, sh_addralign = 4, sh_entsize = 12}
  i 29 {sh_name = 385, sh_type = 3, sh_flags = 0, sh_addr = 0, sh_offset = 977656, sh_size = 227, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 0}
  i 30 {sh_name = 1, sh_type = 2, sh_flags = 0, sh_addr = 0, sh_offset = 977884, sh_size = 47088, sh_link = 31, sh_info = 2229, sh_addralign = 4, sh_entsize = 16}
  i 31 {sh_name = 9, sh_type = 3, sh_flags = 0, sh_addr = 0, sh_offset = 1024972, sh_size = 20201, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 0}
  i 32 {sh_name = 17, sh_type = 3, sh_flags = 0, sh_addr = 0, sh_offset = 1045173, sh_size = 394, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 0}
...

... and all looks good -- however, running the program on the post_project.elf file results with:

$ inspector.exe --elf-file post_project.elf
...
  i 24 {sh_name = 329, sh_type = 1, sh_flags = 0, sh_addr = 0, sh_offset = 674768, sh_size = 30064, sh_link = 0, sh_info = 0, sh_addralign = 8, sh_entsize = 0}
  i 25 {sh_name = 343, sh_type = 1, sh_flags = 0, sh_addr = 0, sh_offset = 704832, sh_size = 198362, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 0}
  i 26 {sh_name = 355, sh_type = 1, sh_flags = 48, sh_addr = 0, sh_offset = 903194, sh_size = 50250, sh_link = 0, sh_info = 0, sh_addralign = 1, sh_entsize = 1}
  i 27 {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 0, sh_offset = 0, sh_size = 0, sh_link = 0, sh_info = 0, sh_addralign = 0, sh_entsize = 0}
  i 28 {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 0, sh_offset = 0, sh_size = 0, sh_link = 0, sh_info = 0, sh_addralign = 0, sh_entsize = 0}
  i 29 {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 0, sh_offset = 0, sh_size = 0, sh_link = 0, sh_info = 0, sh_addralign = 0, sh_entsize = 0}
  i 30 {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 0, sh_offset = 0, sh_size = 0, sh_link = 0, sh_info = 0, sh_addralign = 0, sh_entsize = 0}
  i 31 {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 0, sh_offset = 0, sh_size = 0, sh_link = 0, sh_info = 0, sh_addralign = 0, sh_entsize = 0}
  i 32 {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 0, sh_offset = 0, sh_size = 0, sh_link = 0, sh_info = 0, sh_addralign = 0, sh_entsize = 0}
...

... and later on, these null offset addresses cause segfaults/corruption.

So, once the read_section_header_table function hits section index i==27 (seemingly .debug_frame), the read(fd, (void *)&sh_table[i], eh.e_shentsize) basically reads all zeroes (and writes them) into the sh_table[i] structure(s) - and mind you, this does not trip the assert that wraps it, so the system considers proper 40 bytes to have been read in these calls as well!

And note also, that the reads for post_project.elf before index 27 actually look quite reasonable (say for i==26, sh_offset = 903194 = 0xdc81a, the same offset reported by objdump for the same file) ?!

The only way I can describe this so far, is basically read failing in the middle of reading a file ?!?!

I've never experienced anything like this - so I'm really wondering under what possible conditions would read here fail, considering that:

If elf-parser library was all that wrong in pointer arithmetic, it should have failed also on pre_project.elf, which it didn't (and in fact, ran fine for days).
If the post_project.file itself was corrupt as an ELF file - then readelf should not have been able to process it either, and it does
Maybe the post_project.file was on disk with a corrupt sector - but I tried copying both it and the executable at several different paths in my system, they all result with read failure (and a corrup sector would have tripped readelf too)

The only thing I can see as a possible reason here, is that - considering that read is a syscall, in principle it "asks" the OS, here Windows, for data - maybe Windows somehow flagged post_project.elf as a virus or something, then we start reading, then once Windows realizes something is reading this file, it stops delivering data?! But shouldn't have that resulted with a read failure at least? (and plus why flag an .elf file as a virus - it's not even a Windows executable?!)

Consult [the man page](https://man7.org/linux/man-pages/man2/read.2.html) and modify the code to verify that `read()` actually is failing, and check `errno` to learn _why_. Before that you are only guessing that it is the true failure point. — Dúthomhas, Oct 05 '22 at 02:31
Never, *ever* put code in an assertion that cannot safely be altogether omitted. Because that code might actually **be** omitted, depending on how you compile the program. Generally, that implies that among the things you must not use assertions for is checking runtime error conditions. Assertions are for checking your prorgam's invariants. If an assertion ever fails, it means that your program is wrong -- either an invariant on which your code relies is not actually satisfied, or the assertion itself is wrong. — John Bollinger, Oct 05 '22 at 03:02
Also, an assertion failure produces a rather uninformative program termination. It's much more helpful to terminate with a diagnostic message that gives some idea of what went wrong. For failures that result in `errno` being set (which is not all of them -- check function documentation), the `perror()` function is a convenient way to emit such a diagnostic. After that, you can `exit(1)` or `abort()` or whatever -- even attempt to recover. — John Bollinger, Oct 05 '22 at 03:12
In answer to the question you asked: `read` can basically fail if (1) you hand it a bad (not open) file descriptor to read from, (2) you hand it a bad address to write to, or (3) there's an i/o error. Inspecting `errno` after read fails can help you differentiate these cases. Additionally, there are many situations under which `read` can return fewer characters than you asked for, and you may consider this an "error" also (although usually it's not). — Steve Summit, Oct 05 '22 at 03:12
Since you know the offsets into the file, did you look into it with a hex file viewer? Just to make sure that it really does not contain zeroes... — the busybee, Oct 05 '22 at 06:33

score 0 · Answer 1 · edited Oct 06 '22 at 11:16

One circumstance - and the one relevant for the problem described in the OP - where C read() might fail in a middle of a loop, after having started in the loop correctly, - is when using open in MINGW64 in Windows: apparently then, for file open(), the "initial default setting is text mode" where "a byte of 0x1A could be interpreted as EOF".

This in spite of the common understanding that "the POSIX read()/write() functions are implicitly binary", and the implication that MINGW64 emulates those (even if by calling Windows APIs under the hood to do that).

So, if read() (in a plainly open()ed file under MINGW64, which is thus opened in text mode) in a loop encounters byte 0x1A in the file stream, it returns the amount of bytes read up to that byte - and not the full amount of bytes requested (i.e. a "short read"). Therefore, the fix for that is, to "change the default translation mode directly by setting the global variable _fmode in your program", or in my case:

    _fmode = _O_BINARY; // set file mode to binary before open(); fixes "short read" in a loop if byte 0x1A (EOF) is encountered
    fd = open(elf_filename, O_RDONLY|O_SYNC);
    if(fd<0) {
        printf("Error %d Unable to open %s\n", fd, elf_filename);
        return;
    }

With this, there are no more short reads encountered, and thus no more "zeroes" in the middle of the stream as described in the OP.

Under what circumstances might C read() fail in a middle of a loop - after having started in the loop correctly?

1 Answers1