2

What is the best way to get the contents of a file into a single character array?

I have read this question:

Easiest way to get file's contents in C

But from the comments, I've seen that the solution isn't great for large files. I do have access to the stat function. If the file size is over 4 gb, should I just return an error?

The contents of the file is encrypted and since it's supplied by the user it could be as large as anyone would want it to be. I want it to return an error and not crash if the file is too big. The main purpose of populating the character array with the contents of a file, is to compare it to another character array and also (if needed and configured to do so) to log both of these to a log file (or multiple log files if necessary).

Community
  • 1
  • 1
SSH This
  • 1,864
  • 4
  • 23
  • 41
  • 4
    If you want to compare the contents of a file to a character array, there's no need to read the entire file into memory. Just iterate through the file (reading say 4096 bytes at a time), checking each byte against the appropriate member in the array. – William Pursell Jan 03 '13 at 19:50
  • Good point, but I will need to write them to a log afterwards, if the user wants to. Perhaps just one process to compare, and then another to write them to the log file? – SSH This Jan 03 '13 at 19:51
  • The answer depends on the size and type of file data. – Jonathan Wood Jan 03 '13 at 19:52
  • The question you linked already answers the question *"What is the best way to get the contents of a file into a single character array?"*. As I understand it, you want to know what is the most efficient way to determine the size of the file... is that it? – netcoder Jan 03 '13 at 19:53
  • "fseek will fail on files >4GB" so the solution would be to get the file size, if it's more than 4gb, then just return an error? – SSH This Jan 03 '13 at 19:55
  • @SSHThis: `fseek` will not definitely fail for files large than 4GB. It depends on your platform. The alternative to `fseek` may also depend on your platform. If you can use `stat`, just use `stat`. Not sure what your question is, really. – netcoder Jan 03 '13 at 19:57
  • This code needs to work on Windows XP+, AIX and Linux, I have previously used constants to split up the code that was required for a specific platform. – SSH This Jan 03 '13 at 19:59
  • Okay well, just adapt the code from the other question to use `stat` to determine `length` instead of `fseek` / `ftell` and you're done... – netcoder Jan 03 '13 at 20:01
  • Well,if the device have enough memory, I believe that you can read files greater than 4GB by reading e.g, 4096 per time and storing in your character array. Of course,doing memory management and making sure before copy to string that there is enough space. This method does one malloc() and probably several malloc() calls(you can extend 4096-size,if you want to) that makes the program a bit more slowly than Nils Pipenbrinck's implementation(in the link pointed by you) but will work. – Jack Jan 03 '13 at 20:01
  • Are you doing some type of computation in the contents from file? or just compare to another its contents? if so,you can do the above process mentioned by me and use some hash function to just get an integer value and free current file contents to have memory to store the second file. Make same hash computation in this file, and then,compare the integer values. You can consider MD5 algorith to this purpose. – Jack Jan 03 '13 at 20:04
  • @netcoder seems like if it's greater than 4gb, it's not worth even trying to read it. I suppose that was what I was trying to find out, and subsequently the purpose of my question. – SSH This Jan 03 '13 at 20:56
  • 1
    Amazing how many times people say "I don't understand you're question" when it's a C question. Perhaps try reading the question in it's entirety before trying to dismiss it? Apprehend your own fallibility and help others. http://translate.google.com/?tl=fr – SSH This Jan 03 '13 at 21:21

2 Answers2

2

You may use fstat(3) from sys/stat.h. Here is a little function to get size of the file, allocate memory if file is less than 4GB's and return (-1) otherwise. It reads the file to the char array passed to char *buffer a char *, which contains the contents of the whole file.It should be free'd after use.

#include <stdio.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <fcntl.h>

char *loadlfile(const char *path)
{
    int file_descr;
    FILE *fp;
    struct stat buf;
    char *p, *buffer;

    fstat((file_descr = open(path, O_RDONLY)), &buf);

// This check is done at preprocessing and requires no check at runtime.
// It basically means "If this machine is not of a popular 64bit architecture,
// it's probably not 128bit and possibly has limits in maximum memory size.
// This check is done for the sake of omission of malloc(3)'s unnecessary
// invocation at runtime.

//    Amd 64               Arm64                      Intel 64       Intel 64 for Microsofts compiler.
#if !defined(__IA_64) || !defined(__aarch64__) || !defined(__ia64__) || !defined(_M_IA64)
#define FILE_MAX_BYTES (4000000000)
    // buf.st_size is of off_t, you may need to cast it.
    if(buf.st_size >= FILE_MAX_BYTES-1)
        return (-1);
#endif

    if(NULL == (buffer = malloc(buf.st_size + 1)))
        return NULL;

    fp = fdopen(file_descr, "rb");

    p = buffer;
    while((*p++ = fgetc(fp)) != EOF)
        ;
    *p = '\0';

    fclose(fp);
    close(file_descr);
    return buffer;
}

A very broad list of pre-defined macros for various things can be found @ http://sourceforge.net/p/predef/wiki/Home/. The reason for the architecture and file size check is, malloc can be expensive at times and it is best to omit/skip it's usage when it is not needed. And querying a memory of max. 4gb for a whole block of 4gb storage is just waste of those precious cycles.

  • Thank you for your code. I am curious about the `#define FILE_MAX_BYTES (4000000000)` line. Is this used as a precaution so that it doesn't surpass 4gb in memory allocation? If I increased this number would it cause problems in some platforms? Thx again. – SSH This Jan 03 '13 at 21:10
  • 1
    Yes, it is a precaution. But that and the `if(buf.st_size >= FILE_MAX_BYTES-1)` check is not really necessary, as `malloc(3)` would return a *null pointer* if enough memory *could not* be allocated. I *don't think* this kind of file size etc. limits exist in most popular C implementations, as these limits are _completely_ platform dependent. Some filesystems won't let +4gb files, some machines (<= 32bit) will have restrictions to maximum memory size and come will have _simply_ less memory than 4gb. I'll modify the code such that the check & limit's removed if machine is > 32bit. –  Jan 03 '13 at 21:39
  • 1
    You're welcome @SSHThis. And I have to thank you too, as this question caused me to find http://sourceforge.net/p/predef/wiki/Home/. This one has great stuff, which anybody would need. I'd suggest you to check it. Lost of valuable information there. –  Jan 03 '13 at 22:05
  • Should fp be closed using `fclose` before the function returns? Does it matter? – SSH This Jan 03 '13 at 23:10
  • No worries, for some reason I am seeing this warning: `warning: assignment makes pointer from integer without a cast` on this line: `fp = fdopen(file_descr, "rb");` – SSH This Jan 03 '13 at 23:12
  • I haven't run this code. I'll try it now. By the way, which compiler do you use? –  Jan 03 '13 at 23:14
  • I am using GCC on a Linux platform – SSH This Jan 03 '13 at 23:16
  • Ah i believe it is because fdopen isn't a standard function. But I believe I have enough to go on here, thanks for your help Mr Kayaalp – SSH This Jan 03 '13 at 23:33
  • 1
    BTW, this segfaults. When buffer is declared inside, its ok. So I'll modify it to create the buffer internally and return a pointer to it. Sorry for untested code. –  Jan 03 '13 at 23:52
  • No worries! You're example code helped me loads, cheers :) Happy 2013 to you! – SSH This Jan 04 '13 at 04:00
1

From that guy's code just do, if I understand your question correctly:

    char * buffer = 0;
    long length;
    FILE * f = fopen (filename, "rb");

    if (f)
    {
    fseek (f, 0, SEEK_END);
    length = ftell (f);
    if(length > MY_MAX_SIZE) {
          return -1;
    }

     fseek (f, 0, SEEK_SET);
     buffer = malloc (length);
    if (buffer)
    {
    fread (buffer, 1, length, f);
    }
    fclose (f);
    }

    if (buffer)
    {
      // start to process your data / extract strings here...
    }
  • Thanks for your response, couple of questions, isn't "rb" read binary? Also, `fseek` can possibly fail? – SSH This Jan 03 '13 at 20:00
  • 1) Yes,it means `read-binary`. 2)I'm not sure if `fseek()` may fails,but if you can return-value from to off_t? (I'm assuming POSIX environment,but I believe that there is the window's equivalent) – Jack Jan 03 '13 at 20:07