0

To print the contents of a file one can use getc:

int ch;
FILE *file = fopen("file.txt", "r");
while ((ch = getc(file)) != EOF) {
    // do something
}

How efficient is the getc function? That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time? For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

chqrlie
  • 131,814
  • 10
  • 121
  • 189
David542
  • 104,438
  • 178
  • 489
  • 842
  • 1
    The intent behind standard I/O buffering is to make character-at-a-time I/O feasible. Unless you take steps to make it do otherwise (by default, in other words), the I/O library will read fairly large chunks of data (1 KiB on macOS, for example) and then dole out the characters one at a time quite efficiently. On modern multithreaded systems, there is some locking and unlocking overhead (nominally, at least) with each use of `getc()` — see [`lockfile()`, `unlockfile()`, `trylockfile()`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/lockfile.html). _[…continued…]_ – Jonathan Leffler Feb 16 '21 at 06:03
  • 2
    _[…continuation…]_ If your code is strictly single-threaded, you can use [`getc_unlocked()` et al](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getc_unlocked.html) to avoid that locking overhead. If your code can handle bigger chunks (using `fread()` etc), reading in chunks from 8 KiB up to maybe 64 KiB may give some performance benefit; your code will handle the access to each character. Bigger buffer sizes usually give negligible benefit, but you can measure to see what works best for you. – Jonathan Leffler Feb 16 '21 at 06:05
  • 1
    `getc()`, `fgetc()` have a read-buffer filled by the I/O system. On Linux the size is `8192` bytes on Windows it is `512` bytes. So it is basically a wash whether you use `fgets()` with a buffer of 8K or less, or you use `getc()`. There will be no measurable difference. The macro used to set the low-level buffer size has changed names a couple of times, the most recent being `BUFSIZ`. The commit making the change is [glibc/libio/stdio.h - #define BUFSIZ 8192](https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/stdio.h;h=b63ee88a776aa12586f086ab49d05baadd12aeed;hb=refs/heads/master#l99) – David C. Rankin Feb 16 '21 at 06:07
  • @David542: what kind of applications are you developing? Do you need to write some code faster than [wc(1)](https://man7.org/linux/man-pages/man1/wc.1.html) or [cat(1)](https://man7.org/linux/man-pages/man1/cat.1.html) operating on the same input file. – Basile Starynkevitch Feb 16 '21 at 06:31
  • Without real work context **your question is actually a matter of opinion**. Since we don't know what kind of application are your coding, what kind of input data (HTML5, JSON, numerical data in ASCII format for weather forcasting ...), for what kind of hardware ([RaspBerryPi](https://raspberrypi.org/) or [Top500](https://top500.org/) super computers?) etc.... – Basile Starynkevitch Feb 16 '21 at 06:57
  • @JonathanLeffler all your links in the first comment 404 for me when I try and click them. – David542 Feb 16 '21 at 07:35
  • 1
    @David542 — memory lapse late at night and a failure to check after adding the comment. Sorry (and thanks for pointing out my error). The function names actually begin with `f`, so the correct link (one page for three functions) is [`flockfile()`, `funlockfile()`, `ftrylockfile()`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html). (And yes, this time I checked the URL after hitting 'submit'.) – Jonathan Leffler Feb 16 '21 at 15:05

1 Answers1

2

That is, how frequently does it actually do operating system calls or something that would take a non-trivial amount of time?

You could look into the source code of GNU libc or of musl-libc to study the implementation of getc. You should also study the implementation of cat(1) and wc(1). Both are open source. And GNU as (part of GNU binutils) is a free software (internally used by most compilations by GCC) which in practice runs very quickly and does textual manipulation (transforming assembler textual input to binary object files). You could take inspiration from its source code.

You could change the buffer size with setvbuf(3)

You may want to read several bytes at once using fread(3) or fgets(3), probably by data pieces of several kilobytes

You can also use the debugger gdb(1) or the strace(1) utility to find out when syscalls(2) are used and which ones.

For example, let's say I had a 10TB file -- would calling this function trillions of times be a poor way to get through data?

Very probably not, because of the kernel's page cache.

You should profile and benchmark your program to find out its bottleneck.

Most of the time it won't be getc. See time(7) and gprof(1) (and compile all your code with GCC invoked as gcc -O3 -pg -Wall)

If raw input performance is critical in your program, consider also using directly and wisely open(2), read(2), mmap(2), madvise(2), readahead(2), posix_fadvise(2), close(2). Most of these syscalls could fail, see errno(3).

You may also change your file system (e.g. from Ext4 to XFS, see ext4(5) and xfs(5)), buy better SSD disks or more physical RAM, or play with mount(2) options, to improve performance.

See also the /proc pseudo-file system (so proc(5)...); and this answer.

You may want to use databases like sqlite or PostGreSQL

Your program could generate C code at runtime (like manydl.c does), try various approaches (compiling the generated C code /tmp/generated-c-1234.c as a plugin using gcc -O3 -fPIC /tmp/generated-c-1234.c -shared -o /tmp/generated-plugin-1234.so, then dlopen(3)-ing and dlsym(3)-ing that /tmp/generated-plugin-1234.so generated plugin), and use machine learning techniques to find a very good one (specific to the current hardware and computer). It could also generate machine code more directly using asmjit or libgccjit, try several approaches, and choose the best one for the particular situation. Pitrat's book Artificial Beings and blog (still here) explains in more details this approach. The conceptual framework is called partial evaluation. See also this.

You could also use existing parser generators like GNU bison or ANTLR. They are generating C code.

Ian Taylor's libbacktrace could also be useful in such a dynamic metaprogramming approach (generating various form of C code, and choosing the best ones according to the call stack inspected with dladdr(3)).

Very probably your problem is a parsing problem. So read the first half of the Dragon book.

Before attempting any experimentation, discuss with your manager/boss/client the opportunity to spend months of full time work to gain a few percent of performance. Take into account that the same gain can be obtained by upgrading the hardware.

If your terabyte input textual file does not change often (e.g. is given every week, e.g. in bioinformatics software), it may be worthwhile to preprocess it and transform it -in batch mode- into a binary file, or some sqlite database, or some GDBM indexed file, or a some REDIS thing. Then documenting the format of that binary file or database (using EBNF notation, taking inspiration from elf(5)) is very important.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547