0

Before flagging this as a duplicate, yes, I have read this:

How do you determine the size of a file in C?

I need to determine the size of files. The thing is - I need to do it portably in C. The answers in the above question use the POSIX stat function.

Also, I cannot use fseek(f, 0, SEEK_END) together with ftell(f) as suggested. According to the C11 standard this is undefined behavior.

7.21.9.2:

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

The undefined behavior actually shows up on Windows (I didn't test it on Linux), and I learned this the hard way, in a bigger program. When running the same binary program on the same file, ftell() sometimes returns -1, sometimes the correct size and sometimes garbage values.

A solution I thought about is reading the entire file (until EOF), byte by byte, and counting the number of bytes read. This could work, but it has some disadvantages that I would like to not have to deal with. Firstly, read errors could occur, and this would mess up the whole calculation. Secondly, reading a file byte by byte is slow, and normally, an operating system should be able to tell you the size of a file without even looking at its contents.

Is there any way to get the size of a file portably in C11?

DarkAtom
  • 2,589
  • 1
  • 11
  • 27
  • 1
    You could write a function that does the Right Thing (tm) depending on if it's being compiled on Windows or POSIXish OSes, wrapping the appropriate syscall. – Shawn Apr 23 '21 at 23:50
  • @Shawn That isn't portable. Portable means that the code can run on any system that supports C11, even platforms that may get created in the future. There is no way to create `#ifdef`s for EVERY platform out there. – DarkAtom Apr 23 '21 at 23:53
  • @DarkAtom Can you give an example of where there are no `#ifdef` for OS-dependent calls within a library? – Zoso Apr 23 '21 at 23:56
  • FYI, files do not have fixed sizes, just a size they were a few seconds ago when you tried to check. Your mention of a read error relates to this: If a read error occurs reading a file, does it have a meaningful size? If it is not a thing that can be reliable read from beginning to end, it is not doing a file’s job of storing information, and the size it purportedly had is not useful. – Eric Postpischil Apr 23 '21 at 23:58
  • @Zoso Maybe I explained badly. I didn't say that you can't create `#ifdef`s for every given OS. I said that it is virtually impossible to write a program that covers all the platforms on which it could run. A portable C program could run on a platform which hasn't been created yet. There is no way to cover that with `#ifdef`s. – DarkAtom Apr 23 '21 at 23:59
  • 3
    What you want is simply not possible. There's no standard C functionality for this. Either use libraries, which have tailored OS-specific calls wrapped in its API, or write such funcitons yourself. – Some programmer dude Apr 23 '21 at 23:59
  • You could use a sequence of `SEEK_SET` `fseek` calls in a binary search. – Eric Postpischil Apr 24 '21 at 00:01
  • File size uses different standards depending on what machine one is on, binary _vs_ text, and maybe different sizes depending on what one is measuring exactly. – Neil Apr 24 '21 at 00:01
  • 1
    @DarkAtom I'm afraid that's the only way to go. If you are on a new platform, add a `#error` if none of the `#ifdef`/`#elif` match so that you can understand that you need a an implementation for the new platform. – Zoso Apr 24 '21 at 00:01
  • @EricPostpischil reading a file can give read errors in quite a lot of cases. Maybe you are analyzing a file on a thumb drive, which gets removed while reading. It's not the file's fault, it's the user's fault. A file definitely has a fixed size which is stored in some place in the filesystem. All it should take is reading that number. – DarkAtom Apr 24 '21 at 00:03
  • 1
    And considering that Windows, macOS, Linux and other POSIX (or POSIX-ish) variants is the vast majority or systems out there (much more than 99%) do you really need "true" portability to support maybe a handful of fringe operating systems? What is the target of your program? Do you really target some tiny specialist micro-systems that's still big enough to have an OS? Do you really target big-iron specialist systems? Do you really need to cater for 0.001% (or less) of all possible systems? – Some programmer dude Apr 24 '21 at 00:03
  • 3
    @DarkAtom: A file does not always have a fixed size which is stored in some place in the file system. For example, the system has no idea where the end of the tape you mounted is. Or where the end of the deck of punched cards is. Do all network file systems provide the size to clients? – Eric Postpischil Apr 24 '21 at 00:04
  • 1
    Also, what is the *actual* problem you need to solve? Why do you need such a high degree of portability and failure-robustness? This seems very much like an [XY problem](https://en.wikipedia.org/wiki/XY_problem). – Some programmer dude Apr 24 '21 at 00:05
  • @EricPostpischil A tape is not a file, it's a device – DarkAtom Apr 24 '21 at 00:05
  • @Someprogrammerdude I just wanted to know if there is a truly portable way to do this. It seems there is not. I am ok with it, I don't need true portability, but I always look for a truly portable solution before resorting to ugly ones like preprocessor directives. – DarkAtom Apr 24 '21 at 00:07
  • 3
    @DarkAtom: A tape drive is a device. A tape is a storage medium. [Files can be stored on tape.](https://en.wikipedia.org/wiki/File_system#Tape_file_systems) I suspect this may be part of the reason that `SEEK_END` is not necessarily supported. – Eric Postpischil Apr 24 '21 at 00:07
  • @Someprogrammerdude The size of whatever data is on the tape must be known beforehand. Otherwise you don't know when to stop reading. – DarkAtom Apr 24 '21 at 00:08
  • 1
    Start and stop sequences, similar to start and stop bits of a serial port. Or like the "FULL STOP" of telegrams. Or perhaps more like the silence between tracks on old music tapes (which really was used by advanced players to distinguish between tracks). :) – Some programmer dude Apr 24 '21 at 00:13
  • `stat()` and whatever the Win32 equivalent is is going to cover 99.9% of operating systems you're likely to see your code run on. And you can always add the remnant if you actually need to. That's what configure scripts are good for; testing to see what features are available on a given system and picking the appropriate ones for it. – Shawn Apr 24 '21 at 00:15
  • Don't forget that if you get it to "work" it may still not work due to race conditions. E.g. if it reports that a file is 1234 bytes, then a different process appends 200 bytes to the end of the file, and then you assume that the file is 1234 bytes, then your code will be borked. To guard against that you need file locking (e.g. lock the file so no other process can modify it, then determine its size and act on that information, then unlock the file); but file locking is a huge mess (and not just a "not portable" mess). – Brendan Apr 24 '21 at 00:44
  • @MarcoBonelli Implementation defined behavior still implies that the behavior is well defined. But in my case it isn't. – DarkAtom Apr 24 '21 at 00:45
  • [Windows has a `_stat64` function](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/stat-functions?view=msvc-160) almost identical to the POSIX version. – dbush Apr 24 '21 at 01:00
  • @MarcoBonelli *that is not undefined behavior, it's implementation defined behavior. Quite different* No. [It's ***undefined*** behavior](https://port70.net/~nsz/c/c11/n1570.html#note268): "Setting the file position indicator to end-of-file, as with `fseek(file, 0, SEEK_END)`, has undefined behavior for a binary stream ..." – Andrew Henle Apr 24 '21 at 01:07
  • @AndrewHenle ah, of course there's a sneaky footnote 10 screens above the relevant section... wow. Should have looked into it by myself. From the section about `fseek` it is not really defined as UB. But I guess you're right, it is UB. Thanks. – Marco Bonelli Apr 24 '21 at 01:11
  • @DarkAtom Armed with a portable `getsize()`, what problem would you use it for? – chux - Reinstate Monica Apr 24 '21 at 01:31
  • And getting the size of a file is usually a TOCTOU bug anyway. You just know what the size of the file **used to be**. – Andrew Henle Apr 24 '21 at 02:28

1 Answers1

-1

As it turns out, this is impossible to do in pure C11. There are only 2 options:

  1. Reading the file byte by byte

    #include <stdio.h>
    #include <stdint.h>
    
    int64_t fsize(const char filename[])
    {
        int64_t n = 0;
        FILE* f = fopen(filename, "rb");
        if (!f)
            return -1;
        int c = fgetc(f);
        while (c != EOF)
        {
            n++;
            c = fgetc(f);
        }
        if (ferror(f))
            n = -1;
        fclose(f);
        return n;
    }
    
  2. Preprocessor directives

    #if defined(__unix__) || defined(__APPLE__)
        #include <unistd.h>
    #elif defined(_WIN32)
        #include <windows.h>
    #else
        #error "Unknown operating system"
    #endif
    #include <stdint.h>
    
    int64_t fsize(const char filename[])
    {
    #if defined(__unix__) || defined(__APPLE__)
        struct stat64 st;
        if (stat64(filename, &st) == 0)
            return st.st_size;
        return -1;
    #elif defined(_WIN32)
        int32_t low, high;
        HANDLE f = CreateFileA(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
        low = GetFileSize(f, &high);
        int err = GetLastError();
        CloseHandle(f);
        if (low == INVALID_FILE_SIZE && err != NO_ERROR)
            return -1;
        return ((int64_t)high << 32) | low;
    #endif
    }
    
DarkAtom
  • 2,589
  • 1
  • 11
  • 27
  • `char c = fgetc(f);` is wrong. [`fgetc()` returns `int`](https://port70.net/~nsz/c/c11/n1570.html#7.21.7.1) and not `char`. – Andrew Henle Apr 24 '21 at 01:04
  • @AndrewHenle You are right, I don't know how I missed that. – DarkAtom Apr 24 '21 at 01:18
  • 2
    "There are only 2 options:" --> or read a large buffer rather than `fgetc()`, or binary search with `fseek()`, or ... – chux - Reinstate Monica Apr 24 '21 at 01:29
  • chux is correct; I tested a binary search implementation on macOS 10.14.6: Use `fseek` with `SEEK_SET` followed by `fgetc` to test whether there is a byte there. One can test larger and larger offsets (1, 2, 4, 8, 16,…) to find an upper limit, then do a binary search between the last offset that succeeds and the first that fails. (The `fgetc` is needed because `fseek` succeeds in setting the position regardless of whether there is currently a byte available at the position.) – Eric Postpischil Apr 24 '21 at 16:57