14

I've read posts that show how to use fseek and ftell to determine the size of a file.

FILE *fp;
long file_size;
char *buffer;

fp = fopen("foo.bin", "r");
if (NULL == fp) {
 /* Handle Error */
}

if (fseek(fp, 0 , SEEK_END) != 0) {
  /* Handle Error */
}

file_size = ftell(fp);
buffer = (char*)malloc(file_size);
if (NULL == buffer){
  /* handle error */
}

I was about to use this technique but then I ran into this link that describes a potential vulnerability.

The link recommends using fstat instead. Can anyone comment on this?

Frank
  • 18,432
  • 9
  • 30
  • 30

5 Answers5

17

The link is one of the many nonsensical pieces of C coding advice from CERT. Their justification is based on liberties the C standard allows an implementation to take, but which are not allowed by POSIX and thus irrelevant in all cases where you have fstat as an alternative.

POSIX requires:

  1. that the "b" modifier for fopen have no effect, i.e. that text and binary mode behave identically. This means their concern about invoking UB on text files is nonsense.

  2. that files have a byte-resolution size set by write operations and truncate operations. This means their concern about random numbers of null bytes at the end of the file is nonsense.

Sadly with all the nonsense like this they publish, it's hard to know which CERT publications to take seriously. Which is a shame, because lots of them are serious.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    I agree with this w.r.t. the idea that the advice against using fseek to figure out the size of a file improves security. I still would use fstat if I wanted to know the size of a file, but because I think it's clean and clear, not because I think it will avoid security problems by itself. – John Zwinck May 11 '11 at 00:34
  • 2
    Using `fseek` has the advantage that it will work on block device files – bdonlan May 11 '11 at 01:35
  • 1
    I would avoid `fstat` because it precludes operating on device nodes where your application might otherwise work fine. – R.. GitHub STOP HELPING ICE May 11 '11 at 01:37
  • Why would they want text and binary to be read the same way?! that's stupid as hell! – MarcusJ Dec 12 '15 at 16:30
  • 3
    @MarcusJ: I'm not sure what you're asking or what your comment is supposed to contribute to this question or this answer. – R.. GitHub STOP HELPING ICE Dec 12 '15 at 21:49
  • 1
    @MarcusJ Basically POSIX is saying that text and binary files are treated the same so an end of line character on POSIX systems will always be one character. But on Windows it isn't one character and hence the C standard needs both b and r file modes. – Jerry Jeremiah Aug 23 '20 at 22:11
8

If your goal is to find the size of a file, definitely you should use fstat() or its friends. It's a much more direct and expressive method--you are literally asking the system to tell you the file's statistics, rather than the more roundabout fseek/ftell method.

A bonus tip: if you only want to know if the file is available, use access() rather than opening the file or even stat'ing it. This is an even simpler operation which many programmers aren't aware of.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 3
    This relies on POSIX functionality (`fstat`) rather than plain C, and it's less reliable. For instance, if the "file" is `/dev/sda1`, `fstat` will show 0 size, but the `fseek`/`ftell` method will get the size of the partition. (Actually you should use `fseeko`/`ftello` which use `off_t`, since `long` is likely too short to store the size of a partition...) – R.. GitHub STOP HELPING ICE May 11 '11 at 00:26
  • What if the goal is to determine the size of the file in order to allocate a buffer big enough. Does it make a difference? – Frank May 11 '11 at 00:27
  • @Frank: If you need to read the entire file into a buffer at once, you should think hard about why that is. I can think of virtually no cases where it wouldn't be better to read the file in chunks, which anyway is less likely to fail on large files. – John Zwinck May 11 '11 at 00:29
  • 1
    There are plenty of cases where you need to read the whole file. `sort` is the first that comes to mind. – R.. GitHub STOP HELPING ICE May 11 '11 at 00:30
  • Do you think `sort` allocates one humongous buffer the size of the input file? I don't know for sure, but I would be a little surprised. – John Zwinck May 11 '11 at 00:35
  • @John: well, it won't allocate one big buffer if you call it with `-m`, but otherwise I'd expect it to do it. – ninjalj May 11 '11 at 00:42
  • 2
    I just tried it. It will allocate a buffer up to the size of the longest line, but after that, it uses a temporary file to store lines while sorting. It does not seem to allocate one giant region for the entire input file. – John Zwinck May 11 '11 at 01:04
  • In any case you need to read the whole file and store the contents somewhere to implement `sort`. Whether you use memory or a temp file is a small detail for handling extremely large files. There are plenty of cases like this, in any case, where the max file size would be orders of magnitude smaller than virtual address space and thus where loading the whole file may be the simplest approach. – R.. GitHub STOP HELPING ICE May 11 '11 at 01:37
5

I'd tend to agree with their basic conclusion that you generally shouldn't use the fseek/ftell code directly in the mainstream of your code -- but you probably shouldn't use fstat either. If you want the size of a file, most of your code should use something with a clear, direct name like filesize.

Now, it probably is better to implement that using fstat where available, and (for example) FindFirstFile on Windows (the most obvious platform where fstat usually won't be available).

The other side of the story is that many (most?) of the limitations on fseek with respect to binary files actually originated with CP/M, which didn't explicitly store the size of a file anywhere. The end of a text file was signaled by a control-Z. For a binary file, however, all you really knew was what sectors were used to store the file. In the last sector, you had some amount of unused data that was often (but not always) zero-filled. Unfortunately, there might be zeros that were significant, and/or non-zero values that weren't significant.

If the entire C standard had been written just before being approved (e.g., if it had been started in 1988 and finished in 1989) they'd probably have ignored CP/M completely. For better or worse, however, they started work on the C standard in something like 1982 or so, when CP/M was still in wide enough use that it couldn't be ignored. By the time CP/M was gone, many of the decisions had already been made and I doubt anybody wanted to revisit them.

For most people today, however, there's just no point -- most code won't port to CP/M without massive work; this is one of the relatively minor problems to deal with. Making a modern program run in only 48K (or so) of memory for both the code and data is a much more serious problem (having a maximum of a megabyte or so for mass storage would be another serious problem).

CERT does have one good point though: you probably should not (as is often done) find the size of a file, allocate that much space, and then assume the contents of the file will fit there. Even though the fseek/ftell will give you the correct size with modern systems, that data could be stale by the time you actually read the data, so you could overrun your buffer anyway.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 3
    Regarding your last paragraph, there's no overrun danger as long as you use the same length you computed for allocating the buffer and reading into it. And there's no way to read a file atomically, so the possibility of the length being stale, while real, is inevitable. – R.. GitHub STOP HELPING ICE May 11 '11 at 01:46
  • @R: Right -- the problem arises if you compute a length, and then just read the whole file, assuming it'll fit. To avoid the length being stale, you just read to the end of file, and expand the buffer if necessary (or use something like mmap...). – Jerry Coffin May 11 '11 at 02:08
  • Your workaround only works if the only change being made to the file is appending. And `mmap` is even more dangerous. If the file size goes down after `mmap`, you'll `SIGBUS` on read and you cannot safely recover. – R.. GitHub STOP HELPING ICE May 11 '11 at 02:57
  • @R: yes, if truncation is a possibility, you have to take rather different steps. I'd dispute SIGBUS being more dangerous -- it certainly shuts down the program, but doesn't threaten the rest of the system (at worst enables a DoS attack, which is minor compared to many buffer overrun-based attacks). – Jerry Coffin May 11 '11 at 03:18
  • I can't imagine how you would make a buffer overrun here. `p=malloc(len); if (!p) goto error; fread(p,len,1,f);` – R.. GitHub STOP HELPING ICE May 11 '11 at 04:06
  • 5
    @R: You obviously lack imagination! `p=malloc(len); while ((ch=getc())!=EOF) *p++=ch;` Of course, I have a little advantage: I don't have to imagine -- I've fixed code at least that bad. – Jerry Coffin May 11 '11 at 04:26
  • Windows has `GetFileSizeEx`, no need for the `FindFirstFile` nonsense. In general, I am amazed by how limited Linux / POSIX API is compared to WinAPI. For example, you can't get the size of a file in one syscall and without any extra junk that `fstat` returns. – Violet Giraffe Feb 10 '23 at 20:45
  • 1
    @VioletGiraffe: Depends a bit on the situation. In some cases you don't even want to try to read the file if it's too large, but `GetFileSizeEx` doesn't support that--it requires that you open the file before you can get the size (at which point you've already changed the file's last access time, even though you end up not reading the file). So if you're recommending one rather blindly, without knowing all the details of how it's being used, `FindFirstFile` is the less intrusive (and therefore more widely applicable). – Jerry Coffin Feb 11 '23 at 03:42
4

The reason to not use fstat is that fstat is POSIX, but fopen, ftell and fseek are part of the C Standard.

There may be a system that implements the C Standard but not POSIX. On such a system fstat would not work at all.

Zan Lynx
  • 53,022
  • 10
  • 79
  • 131
  • 5
    The point of the CERT advisory is that on such a system, the seek/tell approach might have implementation-defined or undefined behavior. However on such a system you don't have `fstat`. Any system that does have `fstat` certainly has sane, well-defined idea of file offsets and lengths. – R.. GitHub STOP HELPING ICE May 11 '11 at 01:44
  • C is used on some systems where there isn't even an operating system. So you can't really assume POSIX unless you know who is going to run your code. Even Windows isn't POSIX although it does have a fstat function (but the function is called _fstat because Microsoft wants you to know it isn't a standard function.) – Jerry Jeremiah Aug 23 '20 at 22:15
2

According to C standard, §7.21.3:

Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.

A letter-of-the-law kind of guy might think this UB can be avoided by calculating file size with:

fseek(file, -1, SEEK_END);
size = ftell(file) + 1;

But the C standard also says this:

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

As a result, there is nothing we can do to fix this with regard to fseek / SEEK_END. Still, I would prefer fseek / ftell instead of OS-specific API calls.

Medinoc
  • 6,577
  • 20
  • 42
Agnius Vasiliauskas
  • 10,935
  • 5
  • 50
  • 70