After digging through the source a bit more and trying to understand more how setvbuf
and fread
work, I think I understand how buffering
and READAHEAD_BUFSIZE
relate to each other: when iterating through a file, a buffer of READAHEAD_BUFSIZE
is filled on each line, but filling this buffer uses calls to fread
, each of which fills a buffer of buffering
bytes.
Python's read
is implemented as file_read, which calls Py_UniversalNewlineFread, passing it the number of bytes to read as n
. Py_UniversalNewlineFread
then eventually calls fread
to read n bytes.
When you iterate over a file, the function readahead_get_line_skip is what retrieves a line. This function also calls Py_UniversalNewlineFread
, passing n = READAHEAD_BUFSIZE
. So this eventually becomes a call to fread
for READAHEAD_BUFSIZE
bytes.
So now the question is, how many bytes does fread
actually read from disk. If I run the following code in C, then 1024 bytes get copied into buf
and 512 into buf2
. (This might be obvious but never having used setvbuf
before it was a useful experiment for me.)
FILE *f = fopen("test.txt", "r");
void *buf = malloc(1024);
void *buf2 = mallo(512);
setvbuf(f, buf, _IOFBF, 1024);
fread(buf2, 512, 1, f);
So, finally, this suggests to me that when iterating over a file, at least READAHEAD_BUF_SIZE
bytes are read from disk, but it might be more. I think that the first iteration of for line in f
will read x bytes, where x is the smallest multiple of buffering
that is greater than READAHEAD_BUF_SIZE
.
If anyone can confirm that this is what's actually going on, that would be great!