2

I want to test whether the given file's position, referenced by fd, is at the end of file. E. g. current position == file size. Is there a way to do this in less than 3 sys calls? The 3 calls being:

  1. Get current position with lseek
  2. lseek to end of file and store that position (i. e. the file size)
  3. Compare the two, and if they're different, lseek back to the original position.
Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
  • 2
    If this is something you are writing in `c`, best to add that as a tag. Otherwise us shell jockeys are going to suggest rewriting in `awk` or something horrible like that. – JNevill Feb 10 '23 at 20:20
  • @JNevill Good point, thanks :) I considered that, but hesitated, since the question is not about the C language or the standard library. Thought the `file-descriptor` tag is enough. Does `awk` accept file descriptors?.. – Violet Giraffe Feb 10 '23 at 20:21
  • 5
    This is essentially impossible with any number of system calls because, whether a test tells you the position is or is not at the end of file at one moment, another process could truncate or extend the file at the next moment. No result will be reliable the moment after it is obtained. – Eric Postpischil Feb 10 '23 at 20:29
  • 1
    @EricPostpischil: that is true for many file operations, and is of no concern to me or to 99.9% applications that work with files. – Violet Giraffe Feb 10 '23 at 20:32
  • 6
    @VioletGiraffe When you come to a public forum and ask people to help you, one of the risks you run is that they'll give you advice other than what you thought you were looking for. And I have to disagree with you: avoiding race conditions — which is what we're talking about here — is or ought to be of concern to *all* programmers! Race conditions are a real problem, and they're worth learning about and keeping in mind, and learning how to avoid. Often the better way of doing something, to avoid the race condition, ends up being easier anyway, and I suspect that's the case here. – Steve Summit Feb 10 '23 at 20:38
  • @SteveSummit: you're right, thank you and Eric for the helpful comments. – Violet Giraffe Feb 10 '23 at 20:40
  • @VioletGiraffe *that is true for many file operations, and is of no concern to me or to 99.9% applications that work with files.* So you only write code for the easy stuff? OK. – Andrew Henle Feb 10 '23 at 20:42
  • @AndrewHenle: have you ever run into a problem caused by a data race on a file? I haven't. If someone else is messing with your files - that's their problem, not yours, feel free to abort or produce incorrect results. If you are messing with someone else's files - shame on you. – Violet Giraffe Feb 10 '23 at 20:49
  • @VioletGiraffe Back to your main point: as a general rule, seeking is a terrible way to determine the size of a file. Using `stat` or `fstat` is vastly preferable. (But, no, switching from `lseek` to `fstat` doesn't improve the race condition.) – Steve Summit Feb 10 '23 at 20:53
  • @SteveSummit: is it terrible because it's slower, or for some other reason? What about the case when the file has just been read from or written to, e. g. I know it's not "cold"? – Violet Giraffe Feb 10 '23 at 20:55
  • 3
    What are you going to do differently based on whether you are/aren't at end-of-file? Sometimes there's a way to get what you want without making an explicit (and race-condition-prone) test, and that's the ideal solution to, er, seek for. – Steve Summit Feb 10 '23 at 20:55
  • 2
    IMO, seeking is a terrible idea because it, in effect, mucks with a global variable. Everybody knows that global variables are bad, but they're everywhere, and the thing you want to be careful of -- the thing that makes a global variable a problem -- is making a "temporary" change to one. Any time you temporarily change a global variable, do some operation that relies on your temporary change, then quick quick change the global variable back to what it was before, you've got a potential problem. So one of my own rules is to always be on the lookout for this pattern, and avoid it if I can. – Steve Summit Feb 10 '23 at 20:59
  • And my point here is that the state of a file descriptor — including its current seek position — acts exactly like a global variable, even though it isn't actually one. – Steve Summit Feb 10 '23 at 21:00
  • @SteveSummit, that is a good point and for many cases you can get away with just `read()`, but I think it's convoluted and produces code that is harder to work with. It's a shame `read()` does not immediately indicate whether there's more to read, or better yet - how many bytes are left. – Violet Giraffe Feb 10 '23 at 21:02
  • 1
    (cont'd from "my point here is") Now, it may be that you know you're the only one mucking with the file descriptor, such that your temporary modification is "perfectly safe". For me (I'm not saying this has to be for you), I hate rationalizations like that. For me, they're the moral equivalent of saying, "It's perfectly safe to hold this gun to my head and pull the trigger, because I know the gun's not loaded." – Steve Summit Feb 10 '23 at 21:02
  • 2
    Well, there's a good reason that `read` does not "immediately indicate whether there's more to read or how many bytes are left", and that's because the designers of Unix were, you may say misguidedly, focused on those measly 0.1% of applications that cared about the possibility of the answer immediately becoming invalid. :-) – Steve Summit Feb 10 '23 at 21:07
  • 2
    @VioletGiraffe, when there is no more to read, `read` *does* convey that information. And that really doesn't matter until you actually attempt that read. I'm inclined to think that it's usually cleaner and clearer to work with that than to separately test for EOF before attempting to read. Even if you're not interested in handling the case where the file size changes between testing for EOF and attempting to read. – John Bollinger Feb 10 '23 at 21:08
  • @SteveSummit: I don't think that argument has merit, logically speaking. If you're reading a file and expecting to stop eventually, then no matter what stop condition you choose, it is always true that it is possible for some other process or thread to have appended to your file just as your EOF condition evaluated to `true`. I think it's just bad API design. On the other hand, you can get the file size beforehand and just read that many bytes. – Violet Giraffe Feb 10 '23 at 21:10
  • 1
    A call to `ioctl` with the flag `FIONREAD` would return the number of bytes that are immediately available for reading on a file descriptor. But is that reliable and portable? I think not. But it's often used in networking code. – Harith Feb 10 '23 at 21:15
  • A good discussion in the comments to this question: https://stackoverflow.com/questions/56858983/is-there-a-fast-and-reliable-posix-way-to-check-if-current-file-offset-is-at-the?rq=1 – Violet Giraffe Feb 10 '23 at 21:24
  • 2
    @VioletGiraffe *for many cases you can get away with just `read()`, but I think it's convoluted and produces code that is harder to work with* You might want to post that convoluted code, here or in a new question, and ask if people can see a better way of accomplishing the same task. – Steve Summit Feb 10 '23 at 21:48
  • @VioletGiraffe *have you ever run into a problem caused by a data race on a file? I haven't.* Do things [like this](https://www.redbooks.ibm.com/redpapers/pdfs/redp3945.pdf). I have. In fact, I've actually worked with some of the people listed in that paper. – Andrew Henle Feb 11 '23 at 13:25

2 Answers2

4

You can test for end-of-file in just one syscall: a single read! If it returns 0, you're at end-of-file. If it doesn't, you weren't.

...and, of course, if it returns greater than 0, you're not where you were any more, so this might not be a good solution. But if your primary task was reading the file, then the data you've just read with your one read call is quite likely to be data you wanted anyway.

In a comment you said that code that merely calls read can be "convoluted and produce code that is harder to work with", and I kind of know what you mean. I can vaguely remember, once or twice in my career, wishing I could know whether the next read was going to succeed, before I had to do it. But that was just once or twice. The vast, vast majority of the time, for me at least, code that just reads reads reads until one read call returns 0 ends up being perfectly natural and straightforward.


Addendum:

There's some pseudocode from K&R that always sticks with me, for the basic version of grep that they introduce as an example in a fairly early chapter:

while (there's another line) {
    if (line contains pattern) {
        print it;
    }
}

That's for line-based input, but the more-general pattern

while (there's some input)
    process it;

has equal merit, and the fleshing-out to an actual read call doesn't involve that big a change:

while (n = (read(fd, buf, bufsize)) > 0) {
    process n bytes from buf;
}

At first the embedded read-and-test — that is, the assignment to n, and the test against 0, buried in the single control expression of the while loop — used to really bug me, seemed unnecessarily cryptic. But it really, really does encapsulate the "while there's input / process it" idiom rather perfectly, or at least, given a C/Unix-style read call that can only indicate EOF after you call it.

(This is by contrast to Pascal-style I/O, which does indicate EOF before you call it, and is, or used to be, a prime motivator for all the questions that led to Why is while( !feof(file) ) always wrong? being a canonical SO question. Brian Kernighan has a description, probably in Why Pascal Is Not My Favorite Programming Language, of how frustratingly difficult and unnatural it is to implement a Pascal-style input methodology that can explicitly indicate EOF before it happens.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • Thanks for the great answer with quotes and links! I implemented the EOF check as `current position == file size`, so my version of `while( !feof(file) )` does work correctly (instilled, perhaps, by the Qt's `QFile::atEnd()`), but I agree that it's a flawed pattern and your solution is way better. Not necessarily because of TOCTOU issues, but because it's just too many extra work (syscalls and whatnot). It's the worst way to read a file, actually; querying the size beforehand and a `for` loop is much better. – Violet Giraffe Feb 10 '23 at 22:16
  • If you really just want to know whether you're at the end of the file, you can read one character and then use `ungetc` to put it back into the stream. `ungetc` is guaranteed to work once after a read, and it does not invoke any syscall. But honestly, I can't think of any use case. Your answer really nails it. Also: If the file is not local or if it is not a regular file, `read()` might block. And conceivably you would like to know if it's possible to read without blocking. That's a different question, though. – rici Feb 11 '23 at 03:27
3

If you have a file descriptor, you can use fstat() to get the size of the file:

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

struct stat sb;

/* Upon successful completion, 0 shall be returned. 
*  Otherwise, -1 shall be returned and 
*  errno set to indicate the error
*/
if (fstat(fd, &sb) == -1) {
    perror ("fstat()");
    /* Handle error here */
}
off_t size = buf.st_size;

The call lseek() to get the current location.


But as noted in the comments:

"This is essentially impossible with any number of system calls because, whether a test tells you the position is or is not at the end of file at one moment, another process could truncate or extend the file at the next moment. No result will be reliable the moment after it is obtained" — Eric Postpischil

Harith
  • 4,663
  • 1
  • 5
  • 20
  • Hmm, thank you. Do you think one `stat` is faster than two `lseek`s? – Violet Giraffe Feb 10 '23 at 20:33
  • You should at least spell Eric's name correctly. – Mark Ransom Feb 10 '23 at 20:34
  • 1
    *The call `lseek()` to get the current stream position indicator.* The stream position does not have to be the same as the current position of the underlying file descriptor. – Andrew Henle Feb 10 '23 at 20:40
  • Test it. Anything I say would be a wild guess. – Harith Feb 10 '23 at 20:41
  • "Test it" proves nothing. Just because it happens to be true on the one implementation tested in the one environment tested under the one set of conditions tested is pretty meaningless. – Andrew Henle Feb 10 '23 at 20:43
  • Yes, streams are buffered, your suggestion is incorrect. The correct way for a `FILE*` is `ftell` / `fseek`. There's also `feof`, but I don't know how reliable it is. You should just remove that bit from the answer as it's both incorrect and not related to the question. – Violet Giraffe Feb 10 '23 at 20:50
  • @AndrewHenle then test it on every implementation you intend to port your code to, in different environment, tested under many sets of conditions. :) – Harith Feb 10 '23 at 20:51
  • @AndrewHenle Indeed. Would it be reliable after a call to ```fflush()``` followed by `fsync()`? (Assuming an output stream) – Harith Feb 10 '23 at 20:53
  • I think `fflush` is enough, no `fsync` required. You just need to tell the stream to flush its buffer to the kernel, so that the kernel then sees all the writes to the file when yoy issue `lseek`. But if you have a stream, why go around it, why not just `ftell`? – Violet Giraffe Feb 10 '23 at 20:59
  • Yes, why indeed? I got carried away. *"The ftell function obtains the current value of the file position indicator for the stream pointed to by stream.*" – Harith Feb 10 '23 at 21:01