109

I want to write the full contents of a file into a buffer. The file actually only contains a string which i need to compare with a string.

What would be the most efficient option which is portable even on linux.

ENV: Windows

Sunny
  • 7,444
  • 22
  • 63
  • 104

3 Answers3

209

Portability between Linux and Windows is a big headache, since Linux is a POSIX-conformant system with - generally - a proper, high quality toolchain for C, whereas Windows doesn't even provide a lot of functions in the C standard library.

However, if you want to stick to the standard, you can write something like this:

#include <stdio.h>
#include <stdlib.h>

FILE *f = fopen("textfile.txt", "rb");
fseek(f, 0, SEEK_END);
long fsize = ftell(f);
fseek(f, 0, SEEK_SET);  /* same as rewind(f); */

char *string = malloc(fsize + 1);
fread(string, fsize, 1, f);
fclose(f);

string[fsize] = 0;

Here string will contain the contents of the text file as a properly 0-terminated C string. This code is just standard C, it's not POSIX-specific (although that it doesn't guarantee it will work/compile on Windows...)

  • 19
    Just in case any visitors are wondering, `rewind(f);` is equivalent to `fseek(f, 0, SEEK_SET);` and could be used here instead. Both are part of `` – lynks Apr 25 '13 at 11:45
  • 11
    Oh, and before points it out: one must always check the return value of `malloc()` and `fread()`. Here, error checking is omitted only for simplicity - do not copypasta this code verbatim into a production code base. –  Sep 26 '13 at 10:23
  • 2
    Always, [ALWAYS](http://stackoverflow.com/questions/19260209/ftell-returning-incorrect-value) open file with mode `rb` rather than `r`. – Engineer Nov 26 '13 at 19:49
  • @NickWiggill Fair enough (I needed to convert some of my code accordingly a few weeks ago). However, f**k Windows. –  Nov 26 '13 at 19:50
  • 14
    Don't forget to `free()` the `string`. – Tomáš Zato May 13 '14 at 12:08
  • in case the file is so big, don't you think the buffer of this big size(equal to size of file) will cause performance overhead. – Harshit Gupta Dec 12 '14 at 22:55
  • 1
    Care to substantiate the claim that "Windows doesn't provide a lot of functions in the C standard library"? Last time I checked, it was the compiler's job to provide an implementation of the standard library, and every Windows-compatible compiler I know of does. That seems like a pretty big oversight for an accepted answer with 44 upvotes. – Stuntddude Dec 27 '15 at 23:46
  • 18
    Actually, this solution does _not_ stick to the standard; the standard stipulates that "a binary stream need not meaningfully support `fseek` calls with a whence value of `SEEK_END`", and that "setting the file position indicator to end-of-file, as with `fseek(file, 0, SEEK_END)`, has undefined behavior for a binary stream". – Ori Mar 05 '16 at 00:46
  • 7
    Perhaps fsize = fread(string, 1, fsize, f); would be better - in case it's not wholly read. – android.weasel Apr 26 '16 at 14:23
  • 2
    Couple of points. 1/ using fseek() or rewind(0 means you can only read disk files. You will not be able to read a file from: standard input, named pipes, devices, or network streams. 2/ beware of binary files, either accidentally or maliciously. – anthony Jan 05 '17 at 05:01
  • call to malloc is failing with this error: `a.out: malloc.c:2451: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.` – Amninder Singh Nov 10 '17 at 15:41
  • Should this be string[fsize+1] = 0; ? – Kumar Roshan Mehta Feb 09 '18 at 09:02
  • @RoshanMehta If the file content had "George", the `fsize` would be `6`. `6` total characters, we need to add 1 to malloc **to add our null terminator (which is `0`)**. Arrays, when we index, we start with `0`. So like, `string[0]` would be `G`, first character. `string[1]` would be `e`, and etc. `string[fsize + 1] = 0;` is not correct, because we are basically saying for example, `string[7]`, trying to access **what we had not memory allocated**. `string[6]` is our last element of the array, so we null terminate it with `0` (basically assign the element to `0`). Indices will always start `0`. – Jack Murrow Oct 25 '20 at 16:54
  • Dont forget the `string[fsize] = 0;`. Remember memory area `malloc` is undefined the content it could be not 0 and cause buffer overread (because you intend this be a normal 0 terminated string) – Blanket Fox Sep 19 '21 at 09:17
  • Add explanation: `string[fsize] = 0` means assign '\0' at the end of the buffer. – Tan Nguyen Feb 06 '22 at 15:03
43

Here is what I would recommend.

It should conform to C89, and be completely portable. In particular, it works also on pipes and sockets on POSIXy systems.

The idea is that we read the input in large-ish chunks (READALL_CHUNK), dynamically reallocating the buffer as we need it. We only use realloc(), fread(), ferror(), and free():

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>

/* Size of each input chunk to be
   read and allocate for. */
#ifndef  READALL_CHUNK
#define  READALL_CHUNK  262144
#endif

#define  READALL_OK          0  /* Success */
#define  READALL_INVALID    -1  /* Invalid parameters */
#define  READALL_ERROR      -2  /* Stream error */
#define  READALL_TOOMUCH    -3  /* Too much input */
#define  READALL_NOMEM      -4  /* Out of memory */

/* This function returns one of the READALL_ constants above.
   If the return value is zero == READALL_OK, then:
     (*dataptr) points to a dynamically allocated buffer, with
     (*sizeptr) chars read from the file.
     The buffer is allocated for one extra char, which is NUL,
     and automatically appended after the data.
   Initial values of (*dataptr) and (*sizeptr) are ignored.
*/
int readall(FILE *in, char **dataptr, size_t *sizeptr)
{
    char  *data = NULL, *temp;
    size_t size = 0;
    size_t used = 0;
    size_t n;

    /* None of the parameters can be NULL. */
    if (in == NULL || dataptr == NULL || sizeptr == NULL)
        return READALL_INVALID;

    /* A read error already occurred? */
    if (ferror(in))
        return READALL_ERROR;

    while (1) {

        if (used + READALL_CHUNK + 1 > size) {
            size = used + READALL_CHUNK + 1;

            /* Overflow check. Some ANSI C compilers
               may optimize this away, though. */
            if (size <= used) {
                free(data);
                return READALL_TOOMUCH;
            }

            temp = realloc(data, size);
            if (temp == NULL) {
                free(data);
                return READALL_NOMEM;
            }
            data = temp;
        }

        n = fread(data + used, 1, READALL_CHUNK, in);
        if (n == 0)
            break;

        used += n;
    }

    if (ferror(in)) {
        free(data);
        return READALL_ERROR;
    }

    temp = realloc(data, used + 1);
    if (temp == NULL) {
        free(data);
        return READALL_NOMEM;
    }
    data = temp;
    data[used] = '\0';

    *dataptr = data;
    *sizeptr = used;

    return READALL_OK;
}

Above, I've used a constant chunk size, READALL_CHUNK == 262144 (256*1024). This means that in the worst case, up to 262145 chars are wasted (allocated but not used), but only temporarily. At the end, the function reallocates the buffer to the optimal size. Also, this means that we do four reallocations per megabyte of data read.

The 262144-byte default in the code above is a conservative value; it works well for even old minilaptops and Raspberry Pis and most embedded devices with at least a few megabytes of RAM available for the process. Yet, it is not so small that it slows down the operation (due to many read calls, and many buffer reallocations) on most systems.

For desktop machines at this time (2017), I recommend a much larger READALL_CHUNK, perhaps #define READALL_CHUNK 2097152 (2 MiB).

Because the definition of READALL_CHUNK is guarded (i.e., it is defined only if it is at that point in the code still undefined), you can override the default value at compile time, by using (in most C compilers) -DREADALL_CHUNK=2097152 command-line option -- but do check your compiler options for defining a preprocessor macro using command-line options.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • 9
    Upvote for not seeking the file – Elviss Strazdins Apr 25 '18 at 14:59
  • 4
    @ElvissStrazdins: Thanks; that's exactly why I mentioned this one works for pipes and sockets -- they're not seekable at all. Do you have an opinion whether I should add a paragraph about how the seek approach does not work on those? (Neither does the `fstat()` approach, by the way.) Reading the stream until read fails is really the only portable option that works on everything you can get a `FILE` handle on. I'd prefer new C programmers to know that before it bites them in the ankle, you see. – Nominal Animal Aug 13 '18 at 14:30
  • 2
    Not only pipes and sockets are the problem, but getting the file size with fstat and then reading the file creates a race condition in case the file is being modified externally (adding or erasing data from it in another process). – Elviss Strazdins Aug 14 '18 at 00:57
  • 5
    @ElvissStrazdins: Very true. Yet, just about all the answers to this and similar questions use the seek method. Similarly, one should use `nftw()`/`fts_..()`/`glob()`/`wordexp()` rather than `opendir()`/`readdir()`/`closedir()`, to easily handle files/directories being added/deleted/renamed during traversal. I know I should not care, but I really don't like the idea of more C programmers writing code that works only in specific circumstances, and silently fails - or worse yet, destroys data - otherwise. The world is already full of such code, and we need less of it, not more. – Nominal Animal Aug 14 '18 at 01:31
  • Great solution. However, I'd increment the read amount exponentially to reduce complexity from O(n) to O(log(n)), thus increasing performance for near all file sizes. – Andreas Sep 13 '18 at 13:09
  • 3
    @Andreas: The overhead of a realloc() and a read() syscall is insignificant for larger chunk sizes (2 MiB or larger on currently typical desktop machines), so the operation is I/O bound, and time complexity is irrelevant; the time taken is essentially a linear function of the (large) file size. It is better to limit the amount of allocated but unused memory instead. – Nominal Animal Sep 13 '18 at 16:10
  • 1
    Most of the time the last fread call will be useless and you could test instead `n < READALL_CHUNK` (although this probably as little to no performance impact). This is mentioned in [this QA](https://stackoverflow.com/questions/39322123/is-it-correct-to-check-if-the-number-of-items-read-is-less-than-requested-rathe/39322170#39322170). Am I right ? – Gabriel Devillers Aug 28 '19 at 20:43
  • @GabrielDevillers is right. Most of the time the last fread would be useless. More importantly, this implementation would needlessly realloc `READALL_CHUNK` bytes (in the next loop iteration) and immediately undo that with a subsequent realloc (outside the loop). This especially wasteful if the total file size is less than `READALL_CHUNK` bytes. – sidcha Jul 30 '20 at 11:42
  • @NominalAnimal Can you name a real world case where this actually does happen? I always see people posting these "concerns" about things that quite literally never happen in a real world, without providing any evidence. Most software I use simply makes sure to check whether file is locked (advisory lock) and locks it itselves if its something actually "important". If I see software not doing that its leaving my system really soon :) – Kaihaku Sep 22 '21 at 16:56
  • 1
    This discussion is interesting and potentially valuable. Can anybody name a manual, textbook or resource where all this sort of thing is explained? – Georgina Davenport Nov 26 '21 at 06:07
-2

A portable solution could use getc.

#include <stdio.h>

char buffer[MAX_FILE_SIZE];
size_t i;

for (i = 0; i < MAX_FILE_SIZE; ++i)
{
    int c = getc(fp);

    if (c == EOF)
    {
        buffer[i] = 0x00;
        break;
    }

    buffer[i] = c;
}

If you don't want to have a MAX_FILE_SIZE macro or if it is a big number (such that buffer would be to big to fit on the stack), use dynamic allocation.

md5
  • 23,373
  • 3
  • 44
  • 93
  • 6
    Better allocate huge memory chunks on the heap. Also, please don't read byte-by-byte, I'm sure that in any decent implementation of libc, the `fread()` function provides something more efficient. –  Dec 22 '12 at 13:21
  • 1
    @H2CO3: I completely agree with the fact that reading byte per byte is inefficient, it was just to provide a standard and very easy solution (`fgets` could also do the trick). Also, I don't like to use POSIX functions such as `fread` on Windows, because the POSIX implementation by this operating system is often different from the specifications. About heap allocation, it is written at this end of my answer. – md5 Dec 22 '12 at 13:25
  • 5
    `fread()` is not POSIX-specific. If you don't like to use it, you may abandon `fgetc()` as well. –  Dec 22 '12 at 13:29