29

I'm dealing with small text files that i want to read into a buffer while i process them, so i've come up with the following code:

...
char source[1000000];

FILE *fp = fopen("TheFile.txt", "r");
if(fp != NULL)
{
    while((symbol = getc(fp)) != EOF)
    {
        strcat(source, &symbol);
    }
    fclose(fp);
}
...

Is this the correct way of putting the contents of the file into the buffer or am i abusing strcat()?

I then iterate through the buffer thus:

for(int x = 0; (c = source[x]) != '\0'; x++)
{
    //Process chars
}
Gary Willoughby
  • 50,926
  • 41
  • 133
  • 199
  • 1
    This is wrong. `strcat` concatenates strings. Even if `&symbol` is a `char *`, it is not null-terminated. You should use `fgets` or `fread`. Also, `strcat` is going to be slow in your case anyway because it scans `source` every time it needs to append one character. – Alok Singhal Jan 08 '10 at 16:50
  • Not to mention reading a char at a time will be much slower than using `fread`. – Nick Meyer Jan 08 '10 at 16:57
  • @Nick: I don't know about 'much slower': because of io buffering and possibly inlining of the function calls, the performance impact needs not necessarily be that large; using `fread()` is still a good idea, though – Christoph Jan 08 '10 at 17:07
  • btw: I wouldn't consider a 1m text file 'small' ;) – Christoph Jan 08 '10 at 17:09
  • Look into mmap() to memory map the file. Look out for buffer overflows. Do not use strcat() - even if you fix the issues with null termination, it gives you quadratic behaviour which is bad on megabyte files and a disaster on gigabyte files. – Jonathan Leffler Jan 08 '10 at 17:45
  • @Alok: `&symbol` is null terminated if `symbol` is an int, and you're on a little-endian architecture. Not something I'd want to rely on though. – Mark Ransom Jan 08 '10 at 18:21
  • @Mark: what if `sizeof(int) == 1`? As you said, it's best to not rely on it. – Alok Singhal Jan 08 '10 at 18:51

8 Answers8

88
char source[1000000];

FILE *fp = fopen("TheFile.txt", "r");
if(fp != NULL)
{
    while((symbol = getc(fp)) != EOF)
    {
        strcat(source, &symbol);
    }
    fclose(fp);
}

There are quite a few things wrong with this code:

  1. It is very slow (you are extracting the buffer one character at a time).
  2. If the filesize is over sizeof(source), this is prone to buffer overflows.
  3. Really, when you look at it more closely, this code should not work at all. As stated in the man pages:

The strcat() function appends a copy of the null-terminated string s2 to the end of the null-terminated string s1, then add a terminating `\0'.

You are appending a character (not a NUL-terminated string!) to a string that may or may not be NUL-terminated. The only time I can imagine this working according to the man-page description is if every character in the file is NUL-terminated, in which case this would be rather pointless. So yes, this is most definitely a terrible abuse of strcat().

The following are two alternatives to consider using instead.

If you know the maximum buffer size ahead of time:

#include <stdio.h>
#define MAXBUFLEN 1000000

char source[MAXBUFLEN + 1];
FILE *fp = fopen("foo.txt", "r");
if (fp != NULL) {
    size_t newLen = fread(source, sizeof(char), MAXBUFLEN, fp);
    if ( ferror( fp ) != 0 ) {
        fputs("Error reading file", stderr);
    } else {
        source[newLen++] = '\0'; /* Just to be safe. */
    }

    fclose(fp);
}

Or, if you do not:

#include <stdio.h>
#include <stdlib.h>

char *source = NULL;
FILE *fp = fopen("foo.txt", "r");
if (fp != NULL) {
    /* Go to the end of the file. */
    if (fseek(fp, 0L, SEEK_END) == 0) {
        /* Get the size of the file. */
        long bufsize = ftell(fp);
        if (bufsize == -1) { /* Error */ }

        /* Allocate our buffer to that size. */
        source = malloc(sizeof(char) * (bufsize + 1));

        /* Go back to the start of the file. */
        if (fseek(fp, 0L, SEEK_SET) != 0) { /* Error */ }

        /* Read the entire file into memory. */
        size_t newLen = fread(source, sizeof(char), bufsize, fp);
        if ( ferror( fp ) != 0 ) {
            fputs("Error reading file", stderr);
        } else {
            source[newLen++] = '\0'; /* Just to be safe. */
        }
    }
    fclose(fp);
}

free(source); /* Don't forget to call free() later! */
Michael
  • 11,612
  • 10
  • 41
  • 43
  • @Michael: getch returns an int, presumably symbol is a local variable that is also an int. As long as the character doesn't get sign-extended into a negative number and it's not EOF, it will be less than 256. On a little-endian architecture such as x86 this will be null terminated always, as the high bytes follow the low byte. Accidentally, of course. – Mark Ransom Jan 08 '10 at 18:31
  • Also, some compilers will zero fill uninitialized arrays. The O.P. must have really gotten lucky. – Mark Ransom Jan 08 '10 at 18:32
  • 1
    @Mark: you're right, and I'm sure you know this, but `sizeof(int)` can be 1. @Michael, let's say I read a character 'A' in `symbol`. Then, `&symbol` (even if `symbol` is `char`) is a pointer pointing to `'A'` followed by random data. If `symbol` is an `int`, and `sizeof(int) > 1`, then `&symbol`, when converted to `char *`, points to `'A'` followed by `0` or `0`, depending upon the endianness of the machine. – Alok Singhal Jan 08 '10 at 18:49
  • 1
    @Alok, Mark: I forgot getc() returns an int, not a char. Wow, that is very subtle. – Michael Jan 08 '10 at 21:55
  • 1
    http://stackoverflow.com/a/238607/309483: "If you use ftell, then you must open the file in binary mode. If you open it in text mode, ftell only returns a "cookie" that is only usable by fseek." – Janus Troelsen Oct 18 '12 at 22:16
  • I think your error checking for `fseek` is backwards. Should be `if (fseek(...) != 0) {...error...}` (the first usage is correct, but the second one is backwards) http://www.cplusplus.com/reference/cstdio/fseek/ – crazy2be Nov 01 '13 at 21:03
  • 1
    @Michael - Using `calloc` instead of malloc would mean that you wouldn't have to put `source[++newLen] = '\0';` – Joseph Feb 21 '14 at 10:08
  • Note that `fseek()` does not work on pipes or terminals (or some other devices, but you're less likely to encounter them). So for those file types, the last lot of code won't work well. It's feasible to write code that incrementally allocates the array as you read the data. Double the buffer size each time you need more space; you can free up the excess at the end if you're worried about it (maybe you read 1.1 MiB into a 2.0 MiB buffer and want to hand back the unused 0.9 MiB - that's reasonable). – Jonathan Leffler Apr 09 '15 at 01:19
6

Yes - you would probably be arrested for your terriable abuse of strcat !

Take a look at getline() it reads the data a line at a time but importantly it can limit the number of characters you read, so you don't overflow the buffer.

Strcat is relatively slow because it has to search the entire string for the end on every character insertion. You would normally keep a pointer to the current end of the string storage and pass that to getline as the position to read the next line into.

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
6

If you're on a linux system, once you have the file descriptor you can get a lot of information about the file using fstat()

http://linux.die.net/man/2/stat

so you might have

#include  <unistd.h> 
void main()
{
    struct stat stat;
    int fd;
    //get file descriptor
    fstat(fd, &stat);
    //the size of the file is now in stat.st_size
}

This avoids seeking to the beginning and end of the file.

toweleeele
  • 173
  • 2
  • 8
1

See this article from JoelOnSoftware for why you don't want to use strcat.

Look at fread for an alternative. Use it with 1 for the size when you're reading bytes or characters.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
1

Why don't you just use the array of chars you have? This ought to do it:

   source[i] = getc(fp); 
   i++;
Martin Wickman
  • 19,662
  • 12
  • 82
  • 106
1

Not tested, but should work.. And yes, it could be better implemented with fread, I'll leave that as an exercise to the reader.

#define DEFAULT_SIZE 100
#define STEP_SIZE 100

char *buffer[DEFAULT_SIZE];
size_t buffer_sz=DEFAULT_SIZE;
size_t i=0;
while(!feof(fp)){
  buffer[i]=fgetc(fp);
  i++;
  if(i>=buffer_sz){
    buffer_sz+=STEP_SIZE;
    void *tmp=buffer;
    buffer=realloc(buffer,buffer_sz);
    if(buffer==null){ free(tmp); exit(1);} //ensure we don't have a memory leak
  }
}
buffer[i]=0;
Earlz
  • 62,085
  • 98
  • 303
  • 499
  • wouldn't `realloc` be slow? – ajay Jan 13 '14 at 06:18
  • Sorta, but you really need to worry about `char *buffer[DEFAULT_SIZE]` because it is an array of pointers, not of characters. The assignment to `buffer[i]` is dubious at best; `fgetc()` returns a `char`, not a `char *`. If we pretend that it is `char *buffer = 0;`, you're almost there. You need to read the character into an `int`, and only stash it in the array when you're sure it is not EOF and there is enough space. [`while (!feof(file))` is always wrong](http://stackoverflow.com/questions/5431941/)! This answer needs considerable work (but is the basis for what might be a good answer). – Jonathan Leffler Apr 09 '15 at 01:16
0

Methinks you want fread:

http://www.cplusplus.com/reference/clibrary/cstdio/fread/

wprl
  • 24,489
  • 11
  • 55
  • 70
-2

Have you considered mmap()? You can read from the file directly as if it were already in memory.

http://beej.us/guide/bgipc/output/html/multipage/mmap.html

Ioan
  • 2,382
  • 18
  • 32