0

I'm doing a project for my school and I can't find out how to get the size of a file. Since I need to read a script and use it in my program, I need the size of the file to use either read or fread.

Here is what I've done to get the file size but it doesn't seem to work.

int my_size(int filedesc)
{
    int size = 1;
    int read_output = 1;
    char *buffer;

    for (size = 1; read_output != 0 ; size++) {
        buffer = malloc((size+1)*sizeof(char*));
        read_output = read(filedesc, buffer, size);
        free(buffer);
    }
    return(size);
}

And I'm not allowed to use stat() nor fseek() as rules for this project nor can I use read or fread with an arbitrary size like 100 since scripts given can be either small or big.

Useless
  • 19
  • 1
  • 4
  • If you can't use `fseek`, you will need to implement your own `fseek`, like in [this exercise](http://www.learntosolveit.com/cprogramming/Ex_8.4.html) – Leonardo Alves Machado Jan 02 '18 at 15:38
  • 1
    You could do `size_t count = 0; while (getchar() != EOF) count++;` — but why can't you use `fseek()` plus `ftell()` or `stat()` (and presumably `fstat()` isn't allowed either)? What's the motivation behind rejecting those? Could your input be coming from a terminal or pipe or other non-seekable (non-repeatable) device? Do you need to save a copy of the file contents? – Jonathan Leffler Jan 02 '18 at 15:42
  • @Useless : Doing repeatedly _read()_ as you do above, you won't get the correct size, as _read()_ function will read from the offset set by previous read operation. – סטנלי גרונן Jan 02 '18 at 15:46
  • 1
    Your code is peculiar. You start of by excluding the possibility that the file is empty. You allocate space for 2 bytes; you read one byte; you release the buffer; you increment `size` by `1` to 2; you allocate 3 bytes; you read 2 bytes; you free, and increment size by `1` to 3; you allocate 4 bytes; you read 3 bytes; you free, and increment `size` to 4 (but you've now read 6 bytes); and you continue. – Jonathan Leffler Jan 02 '18 at 15:47
  • That is a terrible way to find a file size, it is very inefficient. You can avoid `stat` by using `fstat`, since the file is already open. – cdarke Jan 02 '18 at 15:49
  • @JonathanLeffler I can't use them since these functions are banned in this project (my school allows very few function thus forcing us to "think outside the box") and I haven't thought of the possibility that the file could be empty. And I see now thanks to you that my malloc is flawed. – Useless Jan 02 '18 at 15:55
  • @cdarke Sadly, fstat is a banned function :/ – Useless Jan 02 '18 at 15:56
  • @JonathanLeffler : Yet - just to clarify things - the problem is not that he allocates one extra byte. The issue here is that he is not taking into account that the offset is changing with each read() operation! And as you correctly mentioned, he doesn't count the bytes already read. – סטנלי גרונן Jan 02 '18 at 15:57

2 Answers2

7

If you can rely on the input to be a persistent file (i.e. residing on storage media), and on that file not being modified during your program's run, then you could pre-read it to the end to count the bytes in it, then rewind.

But outside of an academic exercise, the usual reason to forbid measuring the size via stat(), fseek(), and similar is that the input might not reside on storage media, so that

  1. you cannot determine its size without reading it, but also
  2. you cannot rewind it or seek within it.

The trick then is not how to determine the size in advance, but rather how to do without measuring the size in advance. There are at least two main strategies for that:

  • Don't rely on storing the whole contents in memory at once in the first place. Instead, operate on its contents as they are read, maintaining only enough in memory at any given time to do so.

  • Alternatively, adapt dynamically to the file size. There are many variations on this. For example, if you're just reading the file into a monolithic block then you can malloc() space and realloc() when you find you need more. Or you could store the contents in a linked list, allocating new list nodes as needed.

As for the approach presented in the question, there are several issues with it. It appears to be an attempt to do as I first described -- reading the file to the end to determine its size -- but

  1. It seems to assume that each read() will start at the beginning of the file, or perhaps that read() will fail if it cannot read the full file. Neither is the case. Each read() will start at the file's current position, and will leave the file positioned after the last byte transferred.

  2. Because it changes the file position, your approach will require the file to be rewound after -- via lseek(), for example. But if lseek() can be used for that purpose (and note well my previous comments with respect to files in which you cannot seek), then it would provide a much cleaner approach to measuring the file's size.

  3. You do not account for I/O errors. If one occurred then it would probably send your program into an infinite loop.

  4. Dynamic allocation is comparatively expensive, and you're doing a whole lot of it. If you want to implement the pre-reading strategy, then this would be a better implementation:

    ssize_t count_bytes(int fd) {
        ssize_t num_bytes = 0;
        char buffer[2048];
        ssize_t result;
    
        do {
            result = read(fd, buffer, sizeof(buffer));
            if (result < 0) {
                // handle error ...
            }
            num_bytes += result;
        while (result > 0);
    
        return num_bytes;
    }
    
John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • I see how I could improve my code but `realloc()` is forbidden and I'm not skilled enough with linked lists to use them. – Useless Jan 02 '18 at 16:06
  • As the number of bytes in a file can exceed `SIZE_MAX`, using the widest integer type `uintmax_t` could be useful to accumulate the file size. – chux - Reinstate Monica Jan 02 '18 at 16:20
  • 2
    @Useless, in that case, the constraints on the exercise seem to be designed to force you to employ the first strategy I described: "Don't rely on storing the whole contents in memory at once in the first place." I'm especially inclined to think so if "not skilled enough with linked lists" can be broadened to "not *expected to be* skilled enough with linked lists [...]." – John Bollinger Jan 02 '18 at 16:22
  • @JohnBollinger you have pretty much explained every thing I needed to get over this problem and found an alternative to it (since what you proposed contradicts my school coding style). And found a way using malloc() and adapt it dynamically to the file size but it is still is inefficient. – Useless Jan 02 '18 at 16:35
2

Use the gdb debugger, or strace(1), on your executable, to be compiled with all warnings and debug info : gcc -Wall -Wextra -g with GCC. Read carefully the documentation of read(2), and of every function you are using (including malloc(3), whose failure you forgot to test).

You need to use the result (actually read byte count) of read(2). And you need to handle the error case (when read gives -1) specially.

What is probably happenning, with a long enough file, is that on the first loop you are reading 1 byte, on the second loop you are reading 2 bytes, on the third loop you have read 3 bytes, etc... (and you forgot to compute 1+2+3 in that case).

You should cumulate and sum all the read_output and you should handle the case when read(2) gives less than the size (this should happen the last time your read gave non zero).

I would instead suggest using a fixed buffer (of constant or fixed size), and repeatedly do a read(2) but carefully using the returned byte count (also, handle errors, and EOF condition).

Be aware that system calls (listed in syscalls(2)) are quite expensive. As a rule of thumb, you should read(2) or write(2) a buffer of several kilobytes (and handle carefully the returned byte count, also testing it against errors, see errno(3)). A program read-ing only a few bytes at once each time is inefficient.

Also, malloc (or realloc) is quite expensive. Incrementing the heap allocated size by one is ugly (since you call malloc on every loop; in your case you don't even need to use malloc). You'll better use some geometric progression, perhaps newsize = 4*oldsize/3 + 10; (or similar).

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • I've reread the manual of read(2) and figured out I've misused read I shouldn't have used it like in my example. – Useless Jan 02 '18 at 16:08