14

While looking for ways to find the size of a file given a FILE*, I came across this article advising against it. Instead, it seems to encourage using file descriptors and fstat.

However I was under the impression that fstat, open and file descriptors in general are not as portable (After a bit of searching, I've found something to this effect).

Is there a way to get the size of a file in ANSI C while keeping in line with the warnings in the article?

math4tots
  • 8,540
  • 14
  • 58
  • 95
  • Please note that the article you linked to is Considered Harmful. `fseek`/`ftell` (actually `fseeko`/`ftello`, if you have POSIX, so you can deal with large files) is the preferred way to determine file size. The `stat`-based alternative will fail to determine sizes of some non-regular-files that *do have* well-defined sizes, such as block devices (disk partitions, etc.). – R.. GitHub STOP HELPING ICE Mar 23 '12 at 00:24
  • It's not useful but... open a file in append mode works: FILE* fp = fopen("teste.txt", "a"); size_t sz = ftell(fp); – Tiago Vieira Nov 02 '17 at 10:40

7 Answers7

15

In standard C, the fseek/ftell dance is pretty much the only game in town. Anything else you'd do depends at least in some way on the specific environment your program runs in. Unfortunately said dance also has its problems as described in the articles you've linked.

I guess you could always read everything out of the file until EOF and keep track along the way - with fread() for example.

Carl Norum
  • 219,201
  • 40
  • 422
  • 469
  • I think the answer was downvoted because of the specific wording in the C standard, which, at least should've been mentioned: `Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.` and `A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.` – Alexey Frunze Mar 22 '12 at 22:22
  • Unfortunately it also doesn't provide any other options. – Carl Norum Mar 22 '12 at 23:07
  • Maybe it doesn't have to. You can `fread()` or `fgetc()` until `EOF`, which isn't fast, but should work and be more portable. – Alexey Frunze Mar 22 '12 at 23:11
  • 6
    Note that while ISO C does not define the end of a binary file, POSIX does, and all real-world, post-1980 implementations of C agree on this issue. Binary files have an exact size and you can seek relative to the end. – R.. GitHub STOP HELPING ICE Mar 22 '12 at 23:46
  • 1
    But using POSIX functions is undefined behavior according to C. There is no solution for undefined behavior in solving this problem. `fseek` using `SEEK_END` is undefined behavior, and calling a function that is not in ISO C and not in your program is undefined behavior. Solving this problem, and most other everyday problems, requires removing the ISO C blinders from one's eyes. – Kaz Mar 23 '12 at 19:47
  • @Kaz _"fseek using SEEK_END is undefined behavior"_ - really ? I though that `Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END)` is undefined behavior. So setting position N bytes `before` SEEK_END (`fseek(file, -1, SEEK_END)`)- seems this behaviour is Ok according standard. – Agnius Vasiliauskas Nov 07 '12 at 11:33
  • @0x69 I'd worry about files sized <=1 bytes there. But that looks worth some man page reading – sehe Apr 12 '13 at 07:11
7

The article claims fseek(stream, 0, SEEK_END) is undefined behaviour by citing an out-of-context footnote.

The footnote appears in text dealing with wide-oriented streams, which are streams that the first operation that is performed on them is an operation on wide-characters.

This undefined behaviour stems from the combination of two paragraphs. First §7.19.2/5 says that:

— Binary wide-oriented streams have the file-positioning restrictions ascribed to both text and binary streams.

And the restrictions for file-positioning with text streams (§7.19.9.2/4) are:

For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.

This makes fseek(stream, 0, SEEK_END) undefined behaviour for wide-oriented streams. There is no such rule like §7.19.2/5 for byte-oriented streams.

Furthermore, when the standard says:

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

It doesn't mean it's undefined behaviour to do so. But if the stream supports it, it's ok.

Apparently this exists to allow binary files can have coarse size granularity, i.e. for the size to be a number of disk sectors rather than a number of bytes, and as such allows for an unspecified number of zeros to magically appear at the end of binary files. SEEK_END cannot be meaningfully supported in this case. Other examples include pipes or infinite files like /dev/zero. However, the C standard provides no way to distinguish between such cases, so you're stuck with system-dependent calls if you want to consider that.

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • 1
    The last paragraph is not quite right. ISO C allows binary files to have course size granularity, i.e. for the size to be a number of disk sectors rather than a number of bytes, and as such allows for an unspecified number of zeros to magically appear at the end of binary files. This is the reason `SEEK_END` may not be "meaningfully" supported. Still, no real-world implementation would be this broken; further, POSIX forbids it. – R.. GitHub STOP HELPING ICE Mar 22 '12 at 23:48
  • @R.. Oh, thanks. That would be indeed quite weird. Would those nulls at the end be read by say `fread`? – R. Martinho Fernandes Mar 23 '12 at 00:06
  • 1
    The article does not cite an out of context footnote; it cites a pertinent foonote. The basic claims in the article are based on normative text. The article's author is taking normative text and the notion of undefined behavior out of a rational context, and does not realize that the proposed solution (the use of platform specific functions, not defined in the C program or the standard library) are also, formally, undefined behavior. – Kaz Mar 23 '12 at 00:07
3

The executive summary is that you must use fseek/ftell because there is no alternative (even the implementation specific ones) that is better.

The underlying issue is that the "size" of a file in bytes is not always the same as the length of the data in the file and that, in some circumstances, the length of the data is not available.

A POSIX example is what happens when you write data to a device; the operating system only knows the size of the device. Once the data has been written and the (FILE*) closed there is no record of the length of the data written. If the device is opened for read the fseek/ftell approach will either fail or give you the size of the whole device.

When the ANSI-C committee was sitting at the end of the 1980's a number of operating systems the members remembered simply did not store the length of the data in a file; rather they stored the disk blocks of the file and assumed that something in the data terminated it. The 'text' stream represents this. Opening a 'binary' stream on those files shows not only the magic terminator byte, but also any bytes beyond it that were never written but happen to be in the same disk block.

Consequently the C-90 standard was written so that it is valid to use the fseek trick; the result is a conformant program, but the result may not be what you expect. The behavior of that program is not 'undefined' in the C-90 definition and it is not 'implementation-defined' (because on UN*X it varies with the file). Neither is it 'invalid'. Rather you get a number you can't completely rely on or, maybe, depending on the parameters to fseek, -1 and an errno.

In practice if the trick succeeds you get a number that includes at least all the data, and this is probably what you want, and if the trick fails it is almost certainly someone else's fault.

John Bowler

John Bowler
  • 101
  • 3
3

Use fstat - requires the file descriptor - can get that from fileno from the FILE* - Hence the size is in your grasp along with other details.

i.e.

fstat(fileno(filePointer), &buf);

Where filePointer is the FILE *

and

buf is

struct stat {
    dev_t     st_dev;     /* ID of device containing file */
    ino_t     st_ino;     /* inode number */
    mode_t    st_mode;    /* protection */
    nlink_t   st_nlink;   /* number of hard links */
    uid_t     st_uid;     /* user ID of owner */
    gid_t     st_gid;     /* group ID of owner */
    dev_t     st_rdev;    /* device ID (if special file) */
    off_t     st_size;    /* total size, in bytes */
    blksize_t st_blksize; /* blocksize for file system I/O */
    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
    time_t    st_atime;   /* time of last access */
    time_t    st_mtime;   /* time of last modification */
    time_t    st_ctime;   /* time of last status change */
};
Ed Heal
  • 59,252
  • 17
  • 87
  • 127
  • 1
    As a previous poster noted the OS differ - but the same sort of thing is available by windows. The equivalent of `fstat` is available. – Ed Heal Mar 22 '12 at 21:51
  • Guess the best option is to make it work according to the OS. – Ed Heal Mar 22 '12 at 21:51
  • Voted up because it's standard to POSIX. – Guido Mar 22 '12 at 22:36
  • 6
    Danger, Will Robinson! If you use `fstat()` on an open file to which you have previously been writing stuff via a `FILE*` it could well return the wrong size, due to unbuffered data not yet being written. – David Given Mar 22 '12 at 23:26
  • I was making the assuption that either the person would take this into account by either doing this at the start (as hinted in the OP) or used the flush. – Ed Heal Mar 22 '12 at 23:59
  • @DavidGiven While you point out a common pitfall, obviously `stat` wouldn't be reporting "the wrong size" there - it is actually the size of the file (since the unbuffered changes have... not you been written) – sehe Apr 12 '13 at 07:14
2

different OS's provide different apis for this. For example in windows we have:

GetFileAttributes()

In MAC we have:

[[[NSFileManager defaultManager] attributesOfItemAtPath:someFilePath error:nil] fileSize];

But raw method is only by fread and fseek only: How can I get a file's size in C?

Community
  • 1
  • 1
user739711
  • 1,842
  • 1
  • 25
  • 30
2

You can't always avoid writing platform-specific code, especially when you have to deal with things that are a function of the platform. File sizes are a function of the file system, so as a rule I'd use the native filesystem API to get that information over the fseek/ftell dance. I'd create my own generic wrapper around it, so as to not pollute application logic with platform-specific details and make the code easier to port.

John Bode
  • 119,563
  • 19
  • 122
  • 198
-2

The article has a little problem of logic.

It (correctly) identifies that a certain usage of C functions has behavior which is not defined by ISO C. But then, to avoid this undefined behavior, the article proposes a solution: replace that usage with platform-specific functions. Unfortunately, the use of platform-specific functions is also undefined according to ISO C. Therefore, the advice does not solve the problem of undefined behavior.

The quote in my copy of the 1999 standard confirms that the alleged behavior is indeed undefined:

A binary stream need no meaningfully support fseek calls with a whence value of SEEK_END. [ISO 9899:1999 7.19.9.2 paragraph 3]

But undefined behavior does not mean "bad behavior"; it is simply behavior for which the ISO C standard gives no definition. Not all undefined behaviors are the same.

Some undefined behaviors are areas in the language where meaningful extensions can be provided. The platform fills the gap by defining a behavior.

Providing a working fseek which can seek from SEEK_END is an example of an extension in place of undefined behavior. It is possible to confirm whether or not a given platform supports fseek from SEEK_END, and if this is provisioned, then it is fine to use it.

Providing a separate function like lseek is also an extension in place of undefined behavior (the undefined behavior of calling a function which is not in ISO C and not defined in the C program). It is fine to use that, if available.

Note that those platforms which have functions like the POSIX lseek will also likely have an ISO C fseek which works from SEEK_END. Also note that on platforms where fseek on a binary file cannot seek from SEEK_END, the likely reason is that this is impossible to do (no API can be provided to do it and that is why the C library function fseek is not able to support it).

So, if fseek does provide the desired behavior on the given platform, then nothing has to be done to the program; it is a waste of effort to change it to use that platform's special function. On the other hand, if fseek does not provide the behavior, then likely nothing does, anyway.

Note that even including a nonstandard header which is not in the program is undefined behavior. (By omission of the definition of behavior.) For instance if the following appears in a C program:

#include <unistd.h>

the behavior is not defined after that. [See References below.] The behavior of the preprocessing directive #include is defined, of course. But this creates two possibilities: either the header <unistd.h> does not exist, in which case a diagnostic is required. Or the header does exist. But in that case, the contents are not known (as far as ISO C is concerned; no such header is documented for the Library). In this case, the include directive brings in an unknown chunk of code, incorporating it into the translation unit. It is impossible to define the behavior of an unknown chunk of code.

#include <platform-specific-header.h> is one of the escape hatches in the language for doing anything whatsoever on a given platform.

In point form:

  1. Undefined behavior is not inherently "bad" and not inherently a security flaw (though of course it can be! E.g. buffer overruns linked to the undefined behaviors in the area of pointer arithmetic and dereferencing.)
  2. Replacing one undefined behavior with another, only for the purpose of avoiding undefined behavior, is pointless.
  3. Undefined behavior is just a special term used in ISO C to denote things that are outside of the scope of ISO C's definition. It does not mean "not defined by anyone in the world" and doesn't imply something is defective.
  4. Relying on some undefined behaviors is necessary for making most real-world, useful programs, because many extensions are provided through undefined behavior, including platform-specific headers and functions.
  5. Undefined behavior can be supplanted by definitions of behavior from outside of ISO C. For instance the POSIX.1 (IEEE 1003.1) series of standards defines the behavior of including <unistd.h>. An undefined ISO C program can be a well defined POSIX C program.
  6. Some problems cannot be solved in C without relying on some kind of undefined behavior. An example of this is a program that wants to seek so many bytes backwards from the end of a file.

References:

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • 2
    Oh God, not again… ***It's not undefined behavior.*** –  May 02 '12 at 20:45
  • 1
    I think you mix "undefined behavior" and "implementation defined behavior". – Etienne de Martel May 02 '12 at 20:50
  • @Etienne de Martel, [for the second time](http://stackoverflow.com/a/9831307/142019). –  May 02 '12 at 20:53
  • 3
    Really, I think the mixup is about what 'undefined behaviour' _applies to_: the compiler's behaviour is very welldefined for processing includes. The resulting program obviously can have undefined behaviour (hell, it could even be ill-formed). Usually 'undefined behaviour' refers to the compiler's actions/output. Not the behaviour of the resulting program (although, that of course becomes hard to reason about at the very same time) – sehe May 02 '12 at 20:56
  • No, "undefined behavior" simply means any situation for which the programming language standard either says that it has "undefined behavior", or for which it provides no definition of behavior. It does not mean "not defined by any system or vendor". It means *not standard-defined*. A compiler's behavior is not very well *standard-defined* at all! The C standard only partially defines what happens when `#include ` is processed. Not enough to actually define the consequences. – Kaz Jan 30 '13 at 17:25
  • "Undefined behavior is behavior, such as might arise upon use of an erroneous program construct or erroneous data, for which the C++ Standard imposes no requirements. Undefined behavior may also be expected when the C++ Standard omits the description of any explicit definition of behavior or defines the behavior to be ill-formed, with no diagnostic required." Although ambiguous to some degree, undefined behavior *is* bad in the sense that you can't know what will happen. Not knowing what your program will do is bad isn't it? – shawn1874 May 05 '17 at 22:05
  • I agree with Etienne's comment. Undefined and implementation defined are very different things. Undefined behavior is typically tied to an ill formed program that is wrong, and the language simply imposes no requirements on how to handle that situation. To say that undefined behavior is not always bad is wrong. It doesn't always result in a noticeable problem, but the fact that we can't know what the result might be is automatically bad. – shawn1874 May 05 '17 at 22:08
  • @shawn1874 You're severely mistaken. "Undefined behavior" in the context of ISO C++ (what we're discussing here) means "not defined by ISO C++" not "not defined by nobody at all". Compilers provide useful, documented extensions which fall under ISO C++ undefined behavior, and which programmers use to their advantage. – Kaz May 06 '17 at 06:16