1

I’m new to programming and C and I'm currently working through K&R. Apologies in advance if this isn't the most succinct way of characterizing the problem.

For context, in section 8.6 of K&R (not the exercises but the actual chapter) they implement the function fsize() that prints out the size of files in a directory and its sub-directories recursively. The code in the book uses the syscall read() to implement a basic version of readdir(), which returns a pointer to the next entry in a directory.

Up until this section of K&R, all source code has worked fine on my machine, however, the code in this chapter relies on using the read() function on directories to get their contents, which according to a source I found here [1], doesn’t work on Linux and many modern systems.

However, there exists a syscall getdents() which seems to do roughly same thing [2]. So as an exercise I tried to re-implement readdir() and came across the following problems:

  • read() on directories seems to know in advance the size of each entry, so it's able to return one entry at a time and let read() handle the issue of "remembering" the location of the next entry every time it is called.
  • getdents() on the other hand doesn't know the size of each entry in advance, so I have to read the entire buffer first and then loop through it in the readdir() using the member d_reclen (I copied from the example at the bottom of man getdents), meaning now my readdir() function has to handle the issue of "remembering" the location of the next entry in the stream every time readdir() is called.

So my questions are as follows:

  1. Am I correct in my understanding that getdents() cannot be made to behave like read() in the sense that it can read one entry at a time and handle the "remembering of the next position"?
  2. If it is true that getdents() cannot behave like read(), what is the best way to implement "remembering position", in particular if getdents() need to be called multiple time on several sub-directories? I've shown an excerpt of what I tried below: using the file descriptor assigned by the system as a way of indexing the results of getdents() in an array. However this attempt seems to fail given how opendir() and closedir() are implemented — the system will reassign file descriptors once closedir() has been called and opendir() is called on the next subdirectory (and this information is not available to readdir()).

Last Note: I want my implementation of read_dir() to behave exactly like that of readdir() in K&R. Meaning I wouldn't have to change any of the other functions or structures to make it work.

// NTD: _direct's structure needs to match how system implements directory
// entries. After reading from file descriptor into _direct, we then
// copy only the relevant elements (d_ino and d_name) to Dirent
struct _direct {                // directory entry
    long d_ino;                 // inode number
    off_t d_off;                // Not included in K&R
    unsigned short d_reclen;    // Not included in K&R
    char d_name[];              // long name does not have '\0'
};
 
#define BUFSIZE 4096    // Size of buffer when reading from getdents()
#define MAXFILES 1024   // Max files that read_dir() can open
 
struct _streamdents {
    int pos;
    int nread;
    char *buf;
};
 
// read_dir: read directory entries in sequence
Dirent *read_dir(_dir *dp) 
{
    struct _direct *dirbuf;     // local directory structure
    static Dirent d;            // return: portable structure
    static struct _streamdents *readdents[MAXFILES];
 
    if (dp->fd > MAXFILES - 1) {
        printf("Error in read_dir: Cannot continue reading, too many directories\n");
        return NULL;
    }
 
    // Check if directory has already been read; if not, create stream.
    // Important if fxn is called for a sub-directory and then needs
    // to return to a parent directory and continue reading.
    if (readdents[dp->fd] == NULL) {
        char *buf = malloc(BUFSIZE);
        int nread = syscall(SYS_getdents, dp->fd, buf, BUFSIZE);
        int pos = 0; 
 
        struct _streamdents *newdent = malloc(sizeof(struct _streamdents));
        newdent->buf = buf;
        newdent->pos = pos;
        newdent->nread = nread;
        readdents[dp->fd] = newdent;
    }
 
    struct _streamdents *curdent = readdents[dp->fd];
    int pos = curdent->pos;
    int nread = curdent->nread;
    char *buf = curdent->buf;
 
    while (pos < nread) {
        dirbuf = (struct _direct *) (buf + pos);
        if (dirbuf->d_ino == 0) // slot not in use
            continue;
        d.ino = dirbuf->d_ino;
        strncpy(d.d_name, dirbuf->d_name, DIRSIZ);
        curdent->pos += dirbuf->d_reclen;
        return &d;
    }
 
    if (nread == -1) {
        printf("Error in getdents(): %s\n", strerror(errno));
    }
 
    return NULL;
}

Thank you

jimmy
  • 109
  • 2
  • 9
  • It can't read one at a time, but it does remember the position. – Barmar Aug 16 '20 at 16:43
  • There's example code at https://www.man7.org/linux/man-pages/man2/getdents.2.html – Barmar Aug 16 '20 at 16:48
  • The example was actually what I used to implement my attempt at readdir(). The main problem I found was that the example there reads the entire stream to a buffer first because it needs to use the member d_reclen to know the size of each entry. In other words, the "remembering of the position" feature in getdents() seems superfluous if you need to read the entire stream into a buffer before working with any of the data. But perhaps I'm missing something about to work with getdents(). – jimmy Aug 16 '20 at 17:05
  • It uses nested loops. The outer loop reads BUFSIZE bytes, which usually won't be the entire directory. The inner loop processes each entry within the buffer. The position is remembered between calls in the outer loop. – Barmar Aug 16 '20 at 17:18
  • I see how that would work if you only have one directory that you are reading at a time. But the implementation in K&R also calls itself recursively if it comes across a subdirectory. In that case, what would be the best way of "remembering" positions if getdents() is called multiple times on many subdirectories? My attempt above tries to index them by file descriptor in a static array, however file descriptors doesn't seem to be a reliable identifier because they get reassigned by the system once closedir() is called. – jimmy Aug 16 '20 at 17:43
  • You shouldn't close the FD until you've completed reading the directory. – Barmar Aug 16 '20 at 18:01
  • The code now doesn't close a directory until it is finished. Say the first parent directory is assigned FD=3, then the first sub-directory is assigned FD=4. Once that subdirectory is finished reading, the code in K&R closes the subdirectory. Now if it comes across another subdirectory within the parent directory, the system will assign it FD=4 (ideally it would be FD=5). I could change the way the code deals with closing directories, but that would require altering some of the K&R code outside the readdir() function (which I'd ideally avoid having to do). – jimmy Aug 16 '20 at 19:25
  • If you implement this recursively, like they do in K&R, then it becomes much simpler. But all you have to do is close the FD when `getdents()` returns `0`. – Barmar Aug 17 '20 at 03:58
  • I am trying to implement it recursively like K&R. My code above only includes the readdir() part of the function. dirwalk(), fsize(), all the other functions still looks exactly the same. Perhaps I don't know what you mean though. – jimmy Aug 17 '20 at 14:13
  • I didn't read your code carefully, didn't understand what you were doing with the `readdents` array. I assumed this was the code looping through the directory entries, now I see it's just reading a single entryy, potentially buffering entries. – Barmar Aug 17 '20 at 14:21
  • The `while (pos < nread)` loop looks like an infinite loop, you never update `pos`. – Barmar Aug 17 '20 at 14:25
  • I update it with `curdent->pos += dirbuf->d_reclen`. The loop should only run once per function call (in order to imitate read() being able to return one entry at a time). So right after you update `curdent->pos`, you `return &d` and the next time the function is called, pos is updated with `pos = curdent->pos`. – jimmy Aug 17 '20 at 14:29
  • `pos` is not the same as `curdent->pos`, it just gets its initial value before the loop from that. And the `continue` statement is before that update. So when the `if` condition is true, you'll just keep testing the same directory entry. – Barmar Aug 17 '20 at 14:32
  • The bigger design problem with the code is that you need to call `getdents()` again when you reach the end of the buffer. You only call it when `readdents[dp->fd] == NULL` – Barmar Aug 17 '20 at 14:33
  • Right, I guess I should have used an `if` statement instead of `while`, because it effectively behaves more like an `if` statement. The `return &d` statement (the last line in the loop and immediately after `curdent->post += dirbuf->d_reclen`), ensures that the `while (pos < nread)` loop only runs once. Next time the function is called, `pos` is updated again. – jimmy Aug 17 '20 at 14:35
  • In that case, it seems like you're confusing `continue` with `break`. – Barmar Aug 17 '20 at 14:39
  • But doesn't need need to increment `pos` and repeat until it gets to an entry where `d_ino != 0`? – Barmar Aug 17 '20 at 14:39
  • Can you explain why I should call getdents() again when I reach the end of the buffer? Here perhaps I'll link the entire program here: https://pastebin.com/uhpBcyWc. The only modification I've made in the linked code is that I comment out close_dir() so the function works properly if there is more than one subdirectory. – jimmy Aug 17 '20 at 14:41
  • Unless `buf` is big enough the hold the entire directory, you'll have to call `getdents()` multiple times to get everything. – Barmar Aug 17 '20 at 14:44
  • Perhaps if the directory was very large and contained many entries, but the `#define BUFSIZE 4096` size hasn't been a constraint in my tests so far. And if it were to become an issue I could just increase the value. – jimmy Aug 17 '20 at 14:51
  • why use a 'bear bones' kernel call? Just go ahead and use `readdir()` – user3629249 Aug 18 '20 at 02:18
  • Trying to re-implement readdir() mostly as an exercise. Want to learn more about how kernel calls work under the hood. – jimmy Aug 18 '20 at 16:39

0 Answers0