Converting read-file into C

Question

This is a follow-up of Conversion of string.endswith method into C, where I'm (trying to) convert a python program into C.

The next part is to properly read data from a file into a buffer and check various places it may error. Here is what I have thus far:

#define BUFFER_SIZE 1024*10
char buffer[BUFFER_SIZE];

void exit_with_error(const char* msg)
{
    fprintf(stderr, "%s", msg);
    exit(EXIT_FAILURE);
}

int main(int argc, char* argv[])
{
     // with open(argv[-1) as f:
    //    contents = f.read()

    // 1. Open file
    FILE *fp = fopen(argv[argc-1], "r");
    if (fp == NULL) exit_with_error("Error reading file");

    // 2. Read and confirm BUFFER is OK
    fseek(fp, 0L, SEEK_END);
    long fsize = ftell(fp);
    if (fsize > BUFFER_SIZE) exit_with_error("File is larger than buffer.");

    // 3. Write to buffer and close.
    rewind(fp);
    size_t read_size = fread(buffer, 1, fsize, fp);
    if (read_size != fsize) exit_with_error("There was an error reading the file.");
    fclose(fp);

}

I have a few questions on the above:

For reading a file into a buffer, is it more common to have a standard buffer size and write that into the buffer, or to grab the size and then do a malloc. Is there an advantage of one way over the other?
Is it necessary to do all these error-checks in the main method, or can I just assume that the user provides the correct file (and it's readable, etc.)
Finally, why do some of the file methods return size_t and other return long (such as doing ftell?) Is it safe to use size_t for all of them or are there reason why some are not that type?

1. Not really answerable. It depends on the context and requirements. Sometimes allocating a fixed buffer and reading in chunks is the right thing to do. Sometimes getting the file size and allocating the full buffer is the right thing. 2. Yes it is necessary. Relying on unvalidated user input or not checking function return values is a recipe for disaster. — kaylum, Apr 08 '21 at 21:26
As for question 3: `size_t` depends on your compiler version and is defined either as `unsigned int` for 32bit compilers or as `unsigned long long` for 64bit compilers. It should be best to follow the function's declaration when choosing which type to use. As for `ftell`, one of the error return values is `-1` which does not match `unsigned` types. — Irad Ohayon, Apr 08 '21 at 21:31
If `fopen` fails, it is misleading to give an error message that says "Error reading file". There was no error reading the file. Let the system give you a good error message: `FILE *fp = fopen(argv[argc-1], "r"); if (fp == NULL) { perror(argv[argc - 1]); exit(EXIT_FAILURE); }` (or use a wrapper and have the wrapper print `strerror(errno)`) — William Pursell, Apr 08 '21 at 21:44
@WilliamPursell thanks for that tip. I ended up doing: `exit_with_error("Error on '%s': %s\n", last_arg, strerror(errno));` and then I made the `exit_with_error` accept variable args. — David542, Apr 09 '21 at 00:18
The `fseek`/`ftell` combo is not appropriate to get the read size of the file. Not every file is seakable, the file size may change between `fseek` and reading, and some weird OS (aka Windows) may do file translation on reading. The correct way to put the whole file in a buffer is to just read the file in a loop, writing to the buffer as you go, until EOF, and resizing the buffer as appropriate. Even better is to not have a whole-file-buffer, but process the data as you read. — HAL9000, Apr 09 '21 at 00:31

chux - Reinstate Monica · Accepted Answer · 2021-04-09T01:45:59.960

For reading a file into a buffer, is it more common to have a standard buffer size and write that into the buffer, or to grab the size and then do a malloc. Is there an advantage of one way over the other?

"grab the size" is tricky. long fsize = ftell(fp); works often, but is not specified to achieve the length goal. Highly portable code does not use ftell().

To read the entire file, multiple calls to fread() is the way to go.

Is it necessary to do all these error-checks in the main method, or can I just assume that the user provides the correct file (and it's readable, etc.)

Robust code avoids making assumptions. Better to assume user input is evil (watch Potassium video) and perform lots of error checking.

Finally, why do some of the file methods return size_t and other return long (such as doing ftell?) Is it safe to use size_t for all of them or are there reason why some are not that type?

The range of size_t is typically driven by the memory size and its architecture. The range of file sizes is a file system limitation. It is not uncommon for a files size to potential exceed the memory size. When C first came out long was the largest signed type and was adequate for all cases up to, say, about year 1995. Now files can exceed 2G, (the minimum max value for long). Even with a 32-bit long, huge file offsets can tracked via fgetpos(). Many newer systems employ a 64-bit long, allowing even more range via ftell().

If code attempts to find the file size with ftell() ....

Check ftell() and other I/O function results.

if (fseek(fp, 0, SEEK_END);) {
  exit_with_error("fseek failure.");
}
long fsize = ftell(fp);
if (fsize == -1) {
  exit_with_error("ftell failure.");
}

// In many cases, files exceeding `SIZE_MAX` are also problematic.
// Yet rarely is LONG_MAX > SIZE_MAX
if (fsize > SIZE_MAX) {
  exit_with_error("fsize very large.");
}

Often it is more desirable to open the file in binary mode.

Lastly, IMO, to handle files, avoid the need to read it all into memory and cope with the data in chunks.

Even for smaller files, and even in Python, it is preferable to avoid reading the whole file into memory at once if that can be avoided. — John Bollinger, Apr 09 '21 at 01:42

Converting read-file into C

1 Answers1