6

Story

I tried to diagnose a bug in an app written in C on Linux. It turned out that the bug was caused by forgetting fclose in the child process when the FILE * handle is still open in the parent process.

The file operation is only read. No write operation.

Case 1

The app is running on Linux 5.4.0-58-generic. In this case the bug occured.

Case 2

The app is running on Linux 5.10.0-051000-generic. In this case there is no bug, and this is what I expected.

What is the bug?

The parent process do random number of fork syscall if there is no fclose in child process.

Case 2 affirmation

I am fully aware that forgetting fclose will lead to memory leak, but:

  • I think, just in this case, it is not strictly necessary, because the child process is going to exit as soon as possible, and the exit I use is exit(3) not _exit(2).
  • The strange thing is that, how come forgetting fclose in child process affects the parent process?

My current guess:

This is a Linux kernel bug that has been fixed in the version after 5.4. Yet I don't have a proof, but my test showed me so.


Question

I have been able to fix this app bug by calling fclose in the child process before it exits. But, I want to know what actually happen in this case. So my question is How come forgetting fclose in child process affects the parent process?


Very simple code to reproduce the problem (3 files attached).

Note: The difference between test1.c and test2.c is only at fclose in the child process. test2.c does not call fclose in the child process.

File test.txt

123123123
123123123
123123123
123123123
123123123
123123123

File test1.c

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <sys/wait.h>
#define TICK do { putchar('.'); fflush(stdout); } while(0)
int main() {
  char buff[1024] = {0};
  FILE *handle = fopen("test.txt", "r");

  uint32_t num_of_forks = 0;

  while (fgets(buff, 1024, handle) != NULL) {

    TICK;
    num_of_forks++;

    pid_t pid = fork();
    if (pid == -1) {
      printf("Fork error: %s\n", strerror(errno));
      continue;
    }

    if (pid == 0) {
      fclose(handle);
      exit(0);
    }
  }

  fclose(handle);
  putchar('\n');
  printf("Number of forks: %d\n", num_of_forks);
  wait(NULL);
}

File test2.c

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <sys/wait.h>
#define TICK do { putchar('.'); fflush(stdout); } while(0)
int main() {
  char buff[1024] = {0};
  FILE *handle = fopen("test.txt", "r");

  uint32_t num_of_forks = 0;

  while (fgets(buff, 1024, handle) != NULL) {

    TICK;
    num_of_forks++;

    pid_t pid = fork();
    if (pid == -1) {
      printf("Fork error: %s\n", strerror(errno));
      continue;
    }

    if (pid == 0) {
      // fclose(handle);
      exit(0);
    }
  }

  fclose(handle);
  putchar('\n');
  printf("Number of forks: %d\n", num_of_forks);
  wait(NULL);
}


Run the program


Run on Linux 5.4.0-58-generic (where the bug happens)

Look at test2 execution (bug), it leads to random number of fork syscall.

ammarfaizi2@integral:/tmp$ uname -r
5.4.0-58-generic
ammarfaizi2@integral:/tmp$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ammarfaizi2@integral:/tmp$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.1) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
ammarfaizi2@integral:/tmp$ cat test.txt
123123123
123123123
123123123
123123123
123123123
123123123
ammarfaizi2@integral:/tmp$ diff test1.c test2.c
27c27
<       fclose(handle);
---
>       // fclose(handle);
ammarfaizi2@integral:/tmp$ gcc test1.c -o test1 && gcc test2.c -o test2
ammarfaizi2@integral:/tmp$ ./test1
......
Number of forks: 6
ammarfaizi2@integral:/tmp$ ./test1
......
Number of forks: 6
ammarfaizi2@integral:/tmp$ ./test1
......
Number of forks: 6
ammarfaizi2@integral:/tmp$ ./test2
..................................................................................................................................................................................
Number of forks: 178
ammarfaizi2@integral:/tmp$ ./test2
............................................................................................................................................................................................................................................................................................................................................................
Number of forks: 348
ammarfaizi2@integral:/tmp$ ./test2
...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Number of forks: 475
ammarfaizi2@integral:/tmp$ md5sum test1 test2
c32d03916b9b72546b966223837fd115  test1
f314d2135092362288a66f53b37ffa4d  test2

Run on Linux 5.10.0-051000-generic (the same code, no bug at all)

root@esteh:/tmp# uname -r
5.10.0-051000-generic
root@esteh:/tmp# gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@esteh:/tmp# ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.1) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root@esteh:/tmp# cat test.txt
123123123
123123123
123123123
123123123
123123123
123123123
root@esteh:/tmp# diff test1.c test2.c
27c27
<       fclose(handle);
---
>       // fclose(handle);
root@esteh:/tmp# gcc test1.c -o test1 && gcc test2.c -o test2
root@esteh:/tmp# ./test1
......
Number of forks: 6
root@esteh:/tmp# ./test1
......
Number of forks: 6
root@esteh:/tmp# ./test1
......
Number of forks: 6
root@esteh:/tmp# ./test2
......
Number of forks: 6
root@esteh:/tmp# ./test2
......
Number of forks: 6
root@esteh:/tmp# ./test2
......
Number of forks: 6
root@esteh:/tmp# md5sum test1 test2 # Make sure the files are identical with case 1
c32d03916b9b72546b966223837fd115  test1
f314d2135092362288a66f53b37ffa4d  test2

Summary

  • Forgetting fclose in the child process on Linux 5.4.0-58-generic causes the fork syscall in the parent process be strange.
  • The bug does not seem to exist on Linux 5.10.0-051000-generic.
Ammar Faizi
  • 1,393
  • 2
  • 11
  • 26
  • 4
    *File descriptor in child process should be independent against file descriptor in parent process* That's wrong. FDs across forks are different handles to the same FD after all, just like what you'd get with `dup(2)` or `dup2(2)`. Only closing (and duplicating) is independent, and when you read/write/`lseek(2)` from one FD, the other FD is also affected. – iBug Dec 26 '20 at 13:23
  • @iBug thanks for the correction. I want to clarify something to make sure I understand your comment in higher level (view from utils in `man 3 xxxx`). -- So, if I open a file with `fopen(3)`, **then** I call `fork(2)`, **then** I change the offset of `file handle which is created by the parent process` with `fseek(3)` from the child, **then** the offset of file handle in parent will change according to the `fseek(3)` call I do in child process. Is that right? – Ammar Faizi Dec 26 '20 at 13:38
  • 3
    Yes, exactly. But keep in mind that process scheduling may make the result non-predictable, unless you implement some kind of "syncing". – iBug Dec 26 '20 at 13:39
  • Provide the glibc run-time version on each system. – Hadi Brais Dec 27 '20 at 03:14
  • @HadiBrais ok, I have just provided the glibc version on each system (post has been edited too). They both are identical version `ldd (Ubuntu GLIBC 2.31-0ubuntu9.1) 2.31`. I checked it with `ldd --version` command. – Ammar Faizi Dec 27 '20 at 04:03
  • 3
    I think your problem may be closely related to, if not the same as, the problem analyzed in [Why does forking my process cause the file to be read infinitely?](https://stackoverflow.com/a/50112169/15168) – Jonathan Leffler Dec 27 '20 at 04:10
  • Thanks to @JonathanLeffler, I have been able to discover the real problem from your answer in that thread. – Ammar Faizi Dec 27 '20 at 07:58
  • Thanks to @iBug too about the scheduling insight. – Ammar Faizi Dec 27 '20 at 07:58
  • I suspected that you have a different version of glibc on the Linux 5.10 system in which the IO cleanup code is different. That's why I asked for the versions. But now we know that the "strange" behavior occurs on both systems, so it doesn't matter anymore. BTW, `ldd --version` gives you the linker version, not the glibc run-time version, which can be obtained by calling `gnu_get_libc_version()`. – Hadi Brais Dec 27 '20 at 23:21

1 Answers1

4

Thanks to @Jonathan Leffler!

This problem is a duplicate of Why does forking my process cause the file to be read infinitely

The missing knowledge, why does the bug not occur on Linux 5.10.0-051000-generic turned out that it is not related to the kernel.


It turned out that the parent process competes with the child processes (not related to kernel).

  • Note: change the offset of file handle from child process will also change the offset in parent process if the handle is created by the parent.
  • If there is no fclose(3) in the childs, the child processes will call lseek(2) as soon as they call exit(3). This will cause the parent re-read the same offset, because the childs call lseek(2) with negative offset + SEEK_CUR.

(I don't know why it is necessary to call lseek(2) before exit, it might have been explained in @Jonathan Leffler's answer, I did not read the whole answer carefully).

  • If the parent finishes to read the entire file before the childs call lseek(2). Then there is no problem at all.

Also, as @iBug has mentioned But keep in mind that process scheduling may make the result non-predictable, unless you implement some kind of "syncing".

The parent process on Linux 5.10.0-051000-generic machine I used was just a lucky process that always won to read the entire file first before the childs call lseek(2).

I tried to add more lines to the file (to be 150 lines), so the parent will mostly be slower than reading 6 lines, and the undefined behavior happens.

Test result: https://gist.githubusercontent.com/ammarfaizi2/b72bd03fcc13779f96b8bbeef9253e66/raw/da1eff4ed5434aa51929e5c810d54de8ffe15548/test2_fix.txt

Ammar Faizi
  • 1,393
  • 2
  • 11
  • 26
  • 1
    Why lseek happens: it's an unfortunate part of `fflush` that's needed in the general case of fflush, but probably *not* when exiting. (Unless sticking to the letter of some spec mandates that, e.g. if `exit(3)` specifies `fflush(3)` by name.) And yes, I found this at the bottom of Jonathan's excellent answer. The workaround is to `fflush(stdin)` *before* forking, so exit finds that stdin buffer is empty (and thus the Unix fd position is in sync with the stdio read position, so lseek isn't needed). – Peter Cordes Dec 27 '20 at 08:08