Optimize read + realloc using fstat in C

Question

I want to mmap stdin. But I can't mmap stdin.

So I have to call read and realloc in a loop, and I want to optimize it by choosing a good buffer size.

fstat of the file descriptor 0 gives a struct stat with a member named st_size, which seems to represent the amount of bytes that are already in the pipe buffer corresponding to stdin.

Depending on how I call my program, st_size varies between 0 and the full pipe buffer size of stdin, which is about 65520. For example in this program:

#include <stdio.h>
#include <sys/stat.h>
#include <unistd.h>

int main()
{
    struct stat buf;
    int i;
    for (i=0; i<20; i++)
    {
        fstat(0, &buf);
        printf("%lld\n", buf.st_size);
        usleep(10);
    }
}

We can observe the buffer is still being filled:

$ cc fstat.c
$ for (( i=0; i<10000; i++ )) do echo "0123456789" ; done | ./a.out
9196
9306
9350
9394
9427
9471
9515
9559
...

And the output of this program changes everytime I re-run it.

I'd like to use st_size for the initial buffer size so I have to do less calls to realloc.

I have three questions, most important one first:

Can I 'flush' stdin, i.e. wait until the buffer is not being filled anymore ?
Could I do better ? Maybe there is another way to optimize this, to make the realloc in a loop make feel less dirty.
Can you confirm I can't use mmap with stdin ?

Few details:

I wanted to use mmap because I already wrote a program, for educational purpose, that does what nm does, using mmap. I'm wondering how it does handle stdin.
I know that st_size can be 0, in which case I will have no choice but to define a buffer size.
I know stdin can give more bytes than the pipe buffer size.
I want to optimize for educational purpose. There is no imperative need to it.

Thank you

Devil's advocate question: Why not just use a 64KiB buffer size? That, along with limiting reads to 64KiB, would probably be faster than either reallocing or faffing about with fstat to pre-determine buffer size. We'll need more info about the end goal to really know though. — kdmurrray91, Jan 04 '17 at 23:33
Note: `buf.st_size` is not specified to be `long long` but `off_t` which is some _signed_ integer type. Suggest `printf("%lld\n", (long long) buf.st_size);` — chux - Reinstate Monica, Jan 04 '17 at 23:36
`mmap()` should work just fine for stdin, assuming that stdin is coming from a disk file. You can't mmap() a terminal or a pipe, of course. — Robᵩ, Jan 04 '17 at 23:40
@Robᵩ Well the asker clearly wants to emulate `mmap` on a terminal or a pipe, by reading as much as possible into memory... — user253751, Jan 05 '17 at 00:09
If you're targeting a 64-bit platform, you could use mmap to allocate a really big array (like 1GB) and only commit pages as needed, see [this question](http://stackoverflow.com/questions/2782628/any-way-to-reserve-but-not-commit-memory-in-linux) — user253751, Jan 05 '17 at 00:11
There's no way to get the size of a terminal or pipe, because the data is arriving asynchronously. You can't know the full size of a pipe until the writer closes it. And if you don't read periodically, you may deadlock because there's limited pipe buffering in the kernel. — Barmar, Jan 05 '17 at 00:12
Keep tabs on how much memory you've allocated and how much you've used. Start with, say, 4 KiB allocated; when that's used up, double the amount you allocate; rinse and repeat. If there's enough over-allocated at the end (say, more than 64 KiB), consider a shrinking `realloc()` to release the superfluous space. The pipe buffer size will be fairly modest (between 4 KiB and 64 KiB) usually; you will never read more than {PIPE_SIZE} bytes in a single `read()` call. However, there could be more data than that to read in total. — Jonathan Leffler, Jan 05 '17 at 01:28
But apparently the buffer size for each pipe can be adjusted [How big is the pipe buffer?](http://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer) `:)` — David C. Rankin, Jan 05 '17 at 03:40

Optimize read + realloc using fstat in C

0 Answers0