10

I was skimming through K&R C and I noticed that to read the entries in a directories, they used:

while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf)) == sizeof(dirbuf))
    /* code */

Where dirbuf was a system-specific directory structure, and dp->fd a valid file descriptor. On my system, dirbuf would have been a struct linux_dirent. Note that a struct linux_dirent has a flexible array member for the entry name, but let us assume, for the sake of simplicity, that it doesn't. (Dealing with the flexible array member in this scenario would only require a little extra boilerplate code).

Linux, however, doesn't support this construct. When using read() to try reading directory entries as above, read() returns -1 and errno is set to EISDIR.

Instead, Linux dedicates a system call specifcally for reading directories, namely the getdents() system call. However, I've noticed that it works in pretty much the same way as above.

while (syscall(SYS_getdents, fd, &dirbuf, sizeof(dirbuf)) != -1)
    /* code */

What was the rationale behind this? There seems to be little/no benefit compared to using read() as done in K&R.

  • 1
    This question may be of interest to you: http://unix.stackexchange.com/questions/154119/when-did-directories-stop-being-readable-as-files – Michael Burr Mar 24 '16 at 12:43

3 Answers3

5

getdents will return struct linux_dirent. It will do this for any underlying type of filesystem. The "on disk" format could be completely different, known only to the given filesystem driver, so a simple userspace read call could not work. That is, getdents may convert from the native format to fill the linux_dirent.

couldn't the same thing be said about reading bytes from a file with read()? The on disk format of the data within a file isn't necessary uniform across filesystems or even contiguous on disk - thus, reading a series of bytes from disk would again be something I expect to be delegated to the file system driver.

The discontiguous file data in handled by the VFS ["virtual filesystem"] layer. Regardless of how a FS chooses to organize the block list for a file (e.g. ext4 uses "inodes": "index" or "information" nodes. these use an "ISAM" ("index sequential access method") organization. But, an MS/DOS FS can have a completely different organization).

Each FS driver registers a table of VFS function callbacks when it's started. For a given operation (e.g. open/close/read/write/seek), there is corresponding entry in the table.

The VFS layer (i.e. from the userspace syscall) will "call down" into the FS driver and the FS driver will perform the operation, doing whatever it deems necessary to fulfill the request.

I assume that the FS driver would know about the location of the data inside a regular file on disk - even if the data was fragmented.

Yes. For example, if the read request is to read the first three blocks from the file (e.g. 0,1,2), the FS will look up the indexing information for the file and get a list of physical blocks to read (e.g. 1000000,200,37) from the disk surface. This is all handled transparently in the FS driver.

The userspace program will simply see its buffer filled up with the correct data, without regard to how complex the FS indexing and block fetch had to be.

Perhaps it is [loosely] more proper to refer to this as transferring inode data as there are inodes for files (i.e. an inode has the indexing information to "scatter/gather" the FS blocks for the file). But, the FS driver also uses this internally to read from a directory. That is, each directory has an inode to keep track of the indexing information for that directory.

So, to an FS driver, a directory is much like a flat file that has specially formatted information. These are the directory "entries". This is what getdents returns. This "sits on top of" the inode indexing layer.

Directory entries can be variable length [based on the length of the filename]. So, the on disk format would be (call it "Type A"):

static part|variable length name
static part|variable length name
...

But ... some FSes organize themselves differently (call it "Type B"):

<static1>,<static2>...
<variable1>,<variable2>,...

So, the type A organization might be read atomically by a userspace read(2) call, the type B would have difficulty. So, the getdents VFS call handles this.

couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file?

That is what getdents is for.

Then again, I'm assuming that a FS driver knows the type of each file and thus could return a linux_dirent when read() is called on a directory rather than a series of bytes.

getdents did not always exist. When dirents were fixed size and there was only one FS format, the readdir(3) call probably did read(2) underneath and got a series of bytes [which is only what read(2) provides]. Actually, IIRC, in the beginning there was only readdir(2) and getdents and readdir(3) did not exist.

But, what do you do if the read(2) is "short" (e.g. two bytes too small)? How do you communicate that to the app?

My question is more like since the FS driver can determine whether a file is a directory or a regular file (and I'm assuming it can), and since it has to intercept all read() calls eventually, why isn't read() on a directory implemented as reading the linux_dirent?

read on a dir isn't intercepted and converted to getdents because the OS is minimalist. It expects you to know the difference and make the appropriate syscall.

You do open(2) for files or dirs [opendir(3) is wrapper and does open(2) underneath]. You can read/write/seek for file and seek/getdents for dirs.

But ... doing read for returns EISDIR. [Side note: I had forgotten this in my original comments]. In the simple "flat data" model it provides, there isn't a way to convey/control all that getdents can/does.

So, rather than allow an inferior way to get partial/wrong info, it's simpler for the kernel and an app developer to go through the getdents interface.

Further, getdents does things atomically. If you're reading directory entries in a given program, there may be other programs that are creating and deleting files in that directory or renaming them--right in the middle of your getdents sequence.

getdents will present an atomic view. Either a file exists or it doesn't. It's been renamed or it hasn't. So, you don't get a "half modified" view, regardless of how much "turmoil" is happening around you. When you ask getdents for 20 entries, you'll get them [or 10 if there's only that much].

Side note: A useful trick is to "overspecify" the count. That is, tell getdents you want 50,000 entries [you must provide the space]. You'll usually get back something like 100 or so. But, now, what you've got is an atomic snapshot in time for the full directory. I sometimes do this instead of looping with a count of 1--YMMV. You still have to protect against immediate disappearance but at least you can see it (i.e. a subsequent file open fails)

So, you always get "whole" entries and no entry for a just deleted file. That is not to say that the file is still there, merely that it was there at the time of the getdents. Another process may instantly erase it, but not in the middle of the getdents

If read(2) were allowed, you'd have to guess at how much data to read and wouldn't know which entries were fully formed on in a partial state. If the FS had the type B organization above, a single read could not atomically get the static portion and variable portion in a single step.

It would be philosophically incorrect to slow down read(2) to do what getdents does.

getdents, unlink, creat, rmdir, and rename (etc.) operations are interlocked and serialized to prevent any inconsistencies [not to mention FS corruption or leaked/lost FS blocks]. In other words, these syscalls all "know about each other".

If pgmA renames "x" to "z" and pgmB renames "y" to "z", they don't collide. One goes first and another second but no FS blocks are ever lost/leaked. getdents gets the whole view (be it "x y", "y z", "x z" or "z"), but it will never see "x y z" simultaneously.

Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • This is all well and good, except that you can't tell `getdents()` how many *entries* you want, only the byte size of your buffer. That's what the `count` argument represents, the size of your buffer *in bytes* and not a count of the number of (variable length) dirents you will get back. You have no idea how many directory entries will fit into your buffer without some a-priori knowledge about the directory contents. So, you're right back to guessing how many entries you want and your theory about being able to specify the number of entries is incorrect (see the code sample in the man page). – Michael Goldshteyn Aug 17 '18 at 23:56
  • This answer is completely incorrect. There are many implementations of `read` in the linux kernel which operate atomically and which only return complete records. inotify being one example. The real answer was posted below by user38527: Originally `read` on directories was implemented incorrectly and `getdents` was added as as a second interface which behaved correctly. So it's all about backwards compatibility and not about `read` being unable to perform the job. – MuhKarma May 20 '20 at 13:31
  • @MuhKarma **NO!** `read(2)` has _always_ returned `EISDIR` if it attempted to read from a directory. The _only_ access was via the _syscall_ `readdir`. This has been true since Linux v1.0. I just checked the v1.0 kernel source to be sure [And, I've been using linux since v0.94 (or earlier)]. `getdents` was added when the `linux_dirent` struct was changed (the older `readdir` syscall still exists and returns the older dirent struct). When `getdents` was added, libc `readdir` was rewritten to use it. – Craig Estey May 21 '20 at 18:29
  • @CraigEstey **NO!** `sys_readdir` was added in v0.95c. `sys_read` from directories worked up to v0.98-pl1. See https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/ – MuhKarma May 22 '20 at 14:03
3

In K&R (actually, Unix up through SVr2 at least, perhaps SVr3), directory entries were 16 bytes, using 2 bytes for the inode and 14 bytes for filenames.

Using read made sense, because the directory entries on the disk were all the same size. 16 bytes (a power of 2) also made sense, because it did not require hardware multiply to compute offsets. (I recall someone telling me around 1978 that the Unix disk driver used floating point and was slow... but that's second hand, although amusing).

Later improvements to directories allowed longer names, which meant that the sizes differed (there being no point to making huge entries all the same as the largest possible name). A newer interface was provided, readdir.

Linux provides a lower-level interface. According to its manual page:

These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces.

As illustrated in your example, getdents is a system call, useful for implementing readdir. The manner in which readdir is implemented is unspecified. There is no particular reason why the early readdir (from about 30 years ago) could not have been implemented as a library function using read and malloc and similar functions to manage the long filenames read from the directory.

Moving features into the kernel was done (probably) in this case to improve performance. Because getdents reads several directory entries at a time (unlike readdir), that may reduce the overhead of reading all of the entries for a small directory (by reducing the number of system calls).

Further reading:

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
  • I feel like this doesn't really answer my question. It seems to me like getdents() is simply reading data from the disk into a buffer, which generally done with read (also a system call). So it's not that the functionallity of read() was moved into the kernel; it was always in the kernel. My question was why getdents() instead of read()? Both are system calls; both preform the same basic functionallity (reading data from disk into a buffer); I'm looking for a reason why reading a directory was dedicated it's own system call. – Giorgian Borca-Tasciuc Mar 24 '16 at 00:32
  • I pointed out that the likely reason for this was to move a fairly complicated operation into the kernel for performance. `read` by itself needs a lot of help since the directory entries are not fixed-length. – Thomas Dickey Mar 24 '16 at 00:35
  • But correct me if I'm wrong, but isn't read() in the kernel? Also, I don't understand what the implications of a variable-length records have on read. – Giorgian Borca-Tasciuc Mar 24 '16 at 00:51
  • 1
    @GiorgianBorca-Tasciuc `getdents` will return `struct linux_dirent`. It will do this for any underlying type of filesystem. The "on disk" format could be completely different, known _only_ to the given filesystem driver, so a simple userspace `read` call could _not_ work. That is, `getdents` may convert from the native format to fill the `linux_dirent` – Craig Estey Mar 24 '16 at 00:51
  • 1
    agreed: the application would have to know the way directory entries are stored on disk (possibly different for each filesystem). – Thomas Dickey Mar 24 '16 at 00:58
  • @Craig Etsey; Yes, that makes sense. I'm sorry if I'm beating a dead horse, but...couldn't the same thing be said about reading bytes from a file with read()? The on disk format of the data within a file isn't necessary uniform across filesystems or even contiguous on disk - thus, reading a series of bytes from disk would again be something I expect to be delegated to the file system driver. – Giorgian Borca-Tasciuc Mar 24 '16 at 00:59
  • 1
    Regular files are a different case (they don't have structure that the filesystem driver could be expected to know about). – Thomas Dickey Mar 24 '16 at 01:02
  • I assume that the FS driver would know about the location of the data inside a regular file on disk - even if the data was fragmented. – Giorgian Borca-Tasciuc Mar 24 '16 at 01:05
  • @GiorgianBorca-Tasciuc Yes, but they are different. When a read call is done, it calls the VFS ["virtual file system"] layer's read callback. Each FS driver supports this callback. The VFS probably has a `getdents` callback. So, the driver handles all the non-contig, inodes, etc to present a "flat file" that you can read/write/seek on. So, whatever block list organization an FS uses (e.g. ext4 uses inode and "indexed sequential", a DOS FS is different), the VFS handles this. getdents is an aside to this. Even after the VFS read, you don't know the format. – Craig Estey Mar 24 '16 at 01:07
  • @CraigEstey I understand that much - but couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file? Then again, I'm assuming that a FS driver knows the type of each file and thus could return a linux_dirent when read() is called on a directory rather than a series of bytes. My question is more like since the FS driver can determine whether a file is a directory or a regular file (and I'm assuming it can), and since it has to intercept all read() calls eventually, why isn't read() on a directory implemented as reading the linux_dirent? – Giorgian Borca-Tasciuc Mar 24 '16 at 01:15
  • 3
    @GiorgianBorca-Tasciuc "but couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file?" Yes, that's _exactly_ what `getdents` does. Reread that a few times. I also said as much in my 1st comment. `read` on a dir isn't intercepted and converted to `getdents` because the OS is minimalist. You do `open(2)` for files or dirs [opendir(3) is wrapper and does open(2) underneath]. You can read/write/seek for file and read/seek/getdents for dirs [_no_ write]. If you want standardized info, use `getdents`. If you want raw info, use `read`. – Craig Estey Mar 24 '16 at 01:31
  • 1
    @CraigEstey My apologies for frustrating you, but in your last comment you answered the question I wanted answered - the rational behind getdents(); that the OS (or driver - in my scenario, the FS driver was converting the call from read() to getdents()) is minimalist. I guess I wasn't communicating clearly the question I wanted answered. Thank you, again, I'm sorry for any frustaration I may have caused. If you could make that an answer I would mark it as the correct one. – Giorgian Borca-Tasciuc Mar 24 '16 at 01:38
  • @CraigEstey Also, just out of curiosity, how could I used read() on directories for raw info? Again, read on a directory sets errno to EISDIR and fails. – Giorgian Borca-Tasciuc Mar 24 '16 at 01:46
  • @GiorgianBorca-Tasciuc I'll be out for groceries but will add an answer when I get back – Craig Estey Mar 24 '16 at 01:54
  • @CraigEstey, that was a very long grocery shopping trip! – Michael Goldshteyn Aug 18 '18 at 00:03
2

Your suspicion is correct: It would make more sense to have the read system call work on directories and return some standardized data, rather than have the separate getdents system call. getdents is superfluous and reduces the uniformity of the interface. The other answers assert that "read" as an interface would be inferior in some way to "getdents". They are incorrect. As you can observe, the arguments and return value of "read" and "getdents" are identical; just "read" only works on non-directories and "getdents" only works on directories. "getdents" could easily be folded into "read" to get a single uniform syscall.

The reason this is not the case is historical. Originally, "read" worked on directories, but returned the actual raw directory entry in the filesystem. This was complex to parse, so the getdirents call was added in addition to read, to provide a filesystem-independent view of directory entries. Eventually, "read" on directories was turned off. "read" on directories just as well could have been made to behave identically to getdirents instead of being turned off. It just wasn't, possibly because it seemed duplicative.

In Linux, in particular, "read" has returned an error when reading directories for so long that it's almost certain that some program is relying on this behavior. So, backwards compatibility demands that "read" on Linux will never work on directories.

user38527
  • 29
  • 2