read() system call page fault doesn't depend on file size

Question

I am reading different sized files (1KB - 1GB) using read() in C. But everytime I check the page-faults using perf-stat, it always gives me the same (almost) values.

My machine: (fedora 18 on a Virtual Machine, RAM - 1GB, Disk space - 20 GB)

uname -a
Linux localhost.localdomain 3.10.13-101.fc18.x86_64 #1 SMP Fri Sep 27 20:22:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

mount | grep "^/dev"
/dev/mapper/fedora-root on / type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)

My code:

 10 #define BLOCK_SIZE 1024
. . . 
 19         char text[BLOCK_SIZE];
 21         int total_bytes_read=0;
. . .

 81         while((bytes_read=read(d_ifp,text,BLOCK_SIZE))>0)
 82         {
 83                 write(d_ofp, text, bytes_read); // writing to /dev/null
 84                 total_bytes_read+=bytes_read;
 85                 sum+=(int)text[0];  // doing this just to make sure there's 
                                             // no lazy page loading by read()
                                             // I don't care what is in `text[0]`
 86         }
 87         printf("total bytes read=%d\n", total_bytes_read);
 88         if(sum>0)
 89                 printf("\n");

Perf-stat output: (shows file size, time to read the file and the # of page faults)

[read]:   f_size:    1K B, Time:  0.000313 seconds, Page-faults: 150, Total bytes read: 980 
[read]:   f_size:   10K B, Time:  0.000434 seconds, Page-faults: 151, Total bytes read: 11172
[read]:   f_size:  100K B, Time:  0.000442 seconds, Page-faults: 150, Total bytes read: 103992
[read]:   f_size:    1M B, Time:  0.00191  seconds, Page-faults: 151, Total bytes read: 1040256
[read]:   f_size:   10M B, Time:  0.050214 seconds, Page-faults: 151, Total bytes read: 10402840 
[read]:   f_size:  100M B, Time:  0.2382   seconds, Page-faults: 150, Total bytes read: 104028372 
[read]:   f_size:    1G B, Time:  5.7085   seconds, Page-faults: 148, Total bytes read: 1144312092

Questions:
1. How can the page-faults for a file read() of size of 1KB & 1GB be same ? Since I am reading the data too (code line #84), I am making sure the data is being actually read.
2. The only reason that I can think of that it doesn't encounter that many page-faults is because the data is already present in the main memory. If this is the case, how can I flush it so that when I run my code it actually shows me the true page-faults ? Otherwise I can never measure the true performance of read().

Edit1:
echo 3 > /proc/sys/vm/drop_caches doesn't help, the output still remains the same.

Edit2: For mmap, the output of perf-stat is:

[mmap]:   f_size:    1K B, Time:  0.000103 seconds, Page-faults: 14
[mmap]:   f_size:   10K B, Time:  0.001143 seconds, Page-faults: 151
[mmap]:   f_size:  100K B, Time:  0.002367 seconds, Page-faults: 174
[mmap]:   f_size:    1M B, Time:  0.007634 seconds, Page-faults: 401
[mmap]:   f_size:   10M B, Time:  0.06812  seconds, Page-faults: 2,688
[mmap]:   f_size:  100M B, Time:  0.60386  seconds, Page-faults: 25,545
[mmap]:   f_size:    1G B, Time:  4.9869   seconds, Page-faults: 279,519

Not sure if it is gonna help you. To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches — Sasi V, Apr 26 '14 at 23:38
@wildplasser: what difference it makes ? I should mention, I am not using that sum, I don't care what it calculates. — brokenfoot, Apr 26 '14 at 23:41
These page faults are most probably the result of loading your code, i.e. mmap(2)-ing the executable itself. — Nikolai Fetissov, Apr 26 '14 at 23:52
@Sasi: Thanks, but it doesn't make any difference in the output. — brokenfoot, Apr 26 '14 at 23:52
brokenfoot, can you also check sum of all `bytes_read`? What is your kernel and your filesystem? I think you are not getting lot pagefaults because blocking "sys_read" is capable of page management without faults. It just allocate page just before write to it. Pagefault is only when you access unmapped page, it is a trap. — osgx, Apr 26 '14 at 23:57
The number od pagefaults is more or less constant, The Total execution time differs because of L1 + l2 memory cacne bariiers beeing hit. — wildplasser, Apr 27 '14 at 00:28

osgx · Accepted Answer · 2014-04-27T01:55:33.910

I think you did not understand what exactly is the pagefault. pagefault, according to Wikipedia, is a "trap" (exception), a kind of interrupt, which is generated by CPU itself when programs tries to access something, which is not loaded into physical memory (but usually already registered in virtual memory with its page marked as "not present" P: Present bit = 0).

Pagefault is bad, because it forces CPU to stop execution of user program and switch to kernel. And pagefaults in kernel mode are not so often, because kernel can check page presence before accessing it. If kernel function wants to write something to new page (in your case, the read syscall), it will allocate page by calling page allocator explicitly, and not by trying to access it and faulting into pagefault. There are less interrupts, and less code to execute with explicit memory management.

--- read case ---

Your read is handled by sys_read from fs/read_write.c. Here is call chain (possibly not exact):

472 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
479                 ret = vfs_read(f.file, buf, count, &pos);
  vvv
353 ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
368                         ret = file->f_op->read(file, buf, count, pos);
  vvv

fs/ext4/file.c

626 const struct file_operations ext4_file_operations = {
628         .read           = do_sync_read,

... do_sync_read -> generic_file_aio_read -> do_generic_file_read

mm/filemap.c

1100 static void do_generic_file_read(struct file *filp, loff_t *ppos,
1119         for (;;) {
1120                 struct page *page;
1127                 page = find_get_page(mapping, index);
1128                 if (!page) {
1134                                 goto no_cached_page;  
  // osgx - case when pagecache is empty  ^^vv
1287 no_cached_page:
1288                 /*
1289                  * Ok, it wasn't cached, so we need to create a new
1290                  * page..
1291                  */
1292                 page = page_cache_alloc_cold(mapping);

include/linux/pagemap.h

233 static inline struct page *page_cache_alloc_cold(struct address_space *x)
235         return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
  vvv
222 static inline struct page *__page_cache_alloc(gfp_t gfp)
224         return alloc_pages(gfp, 0);

So I can track that read() syscall ends in page allocation (alloc_pages) via direct calls. After allocating page, read() syscall will do DMA transfer of data from HDD into new page and then return to user (considering the case when file is not cached in pagecache). If the data was already in page cache, read() (do_generic_file_read) will reuse existing page from pagecache, without actual HDD read, by creating additional mapping.

After read() returns, all data is in memory, and read access to it will not generate pagefault.

--- mmap case ---

If you rewrite the test to do mmap()ing of your file and then access (text[offset]) the non-present page of your file (it was not in pagecache), the real pagefault will occur.

All pagefault counters (perf stat and /proc/$pid/stat) are updated ONLY when real pagefault traps were generated by CPU. Here is x86 handler of page fault arch/x86/mm/fault.c, which will work

1224 dotraplinkage void __kprobes
1225 do_page_fault(struct pt_regs *regs, unsigned long error_code)
1230         __do_page_fault(regs, error_code);
  vvv
1001 /*
1002  * This routine handles page faults.  It determines the address,
1003  * and the problem, and then passes it off to one of the appropriate
1004  * routines.
1005  */
1007 __do_page_fault(struct pt_regs *regs, unsigned long error_code)
 /// HERE is the perf stat pagefault event generator VVV 
1101         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

and somewhere later pagefault handler will call handle_mm_fault -> handle_pte_fault -> __do_fault ending in vma->vm_ops->fault(vma, &vmf);.

This fault virtual function was registered in mmap, and I think it is filemap_fault. This function will do actual page allocation (__alloc_page) and disk read in case of empty pagecache (this will be counted as "major" pagefault, because it requires external I/O) or will remap page from pagecache (if the data was prefetched or already in the pagecache, counted as "minor" pagefault, because it was done without external I/O and generally faster).

PS: Doing experiments on the virtual platform may change something; for example, even after cleaning disk cache (pagecache) in the guest Fedora by echo 3 > /proc/sys/vm/drop_caches, data from the virtual hard drive can be still cached by host OS.

I know what a page-fault is. What I don't understand is how can the pages be already in the main memory (already loaded from the disk) for file sizes varying for 1KB-1GB (As I updated in my question, my main memory (VIrtual Machine) is just 1GB) meaning all the pages cannot be present in the main memory at once for a 1GB file. — brokenfoot, Apr 27 '14 at 00:26
Or in other words, how can be the number of page-faults for a file size 1KB and 1GB be equal ? — brokenfoot, Apr 27 '14 at 00:29
Read syscalls loads data from disk without FAULTing, no pagefaults. Pagefault is only load/store access to unmapped page. But read explicitly maps page before access, without interrupts. So, such page creation is not considering (not counted) as pagefault. — osgx, Apr 27 '14 at 00:32
Try mmap()ing the file into memory and then hitting a byte on every page. I expect you'll see the page fault count scale linearly with the file size in this usage. — Jean-Paul Calderone, Apr 27 '14 at 00:35
@Jean-PaulCalderone: Yes, that is true, for `mmap` the values for `page-fault` are: 1KB - 148, 10KB - 151, 100KB - 173, 1MB - 402, 10MB - 2,687, 100MB - 25,545, 1GB - 147,004 . — brokenfoot, Apr 27 '14 at 00:42
@osgx: I didn't know that. Can you please point me to some reference ? — brokenfoot, Apr 27 '14 at 00:43
Actually I now digging in sources between `generic_file_aio_read`, `do_generic_file_read` and `generic_writepages`. The http://www.makelinux.net/books/lkd2/ch15lev1sec1 page says that `writepage()` virtual function (which is called form `sys_read()`) basically do `grab_cache_page` which sonds like read memory page allocator. — osgx, Apr 27 '14 at 00:49
@osgx: So according to the source that you mentioned, it creates a "new" page in case it doesn't have that page already cached. But how is the "actual data" copied to this page. The actual data might be in the disk. Wouldn't that cause a `page-fault` ? I am not able to connect the dots. — brokenfoot, Apr 27 '14 at 01:01
Page allocation is `alloc_page`. Data is read from disk by programming "DMA write" operation (disk will read data and write it to some physical page in RAM). In case of `read()`, it both allocate and program DMA, then wait DMA to complete and then return. Data is in place, no fault. In case of `mmap()`, it only registers memory mapping (the virt. mem XX..YY is part of file ZZZ with offset OFF). When you access the page from the mapping, pagefault will call FS code, which will program DMA, and after DMA is done it will return data to you and reexecute your failed (asm) load operation. — osgx, Apr 27 '14 at 01:11
So, `read()` programs `DMA` before actually accessing the page. And by the time it accesses the page, the pages have already been copied to the main memory by the DMA, so no `page-fault`. Did I understand correct ? And Thank you for your patience ! :) — brokenfoot, Apr 27 '14 at 01:22
Yes, blocking read will load data from HDD before returning to program. This is the cause why blocking read with empty pagecache is so slow (you can publish results from test after drop_caches). In `read` there is no cheating (lazy) like it was in `mmap` (your "touching" anti-cheat code is not needed for `read`, only for `mmap`). [`aio_read`](http://linux.die.net/man/3/aio_read) is more complicated, you should not touch memory before `aio_return` says that read ends;no cheating allowed "*The buffer area being read into must not be accessed during the operation or undefined results may occur.*" — osgx, Apr 27 '14 at 01:26
Ok, in the case of virtual machine, you have one more caching layer, the pagecache of host OS (windows has cache for filesystems too). You did flushing of pagecaches in guest OS, but the virtual hard drive is still may be (partially) cached in host. Use bare metal for test. — osgx, Apr 27 '14 at 01:32
Thanks a lot @osgx ! btw I ran the same program after `drop_cache`, it gave the same output. I'll test on a native linux machine too. — brokenfoot, Apr 27 '14 at 01:32
One last thing, can you update the ans with the comments you made here ? — brokenfoot, Apr 27 '14 at 01:34
I don't want, because I think answer is correct for the original question. No mmap was asked here, and cnicutar in stackoverflow.com/a/23295928/196561 already answered about mmap/read difference. I just marked in his answer where pagefaults are. UPDATE: ok, rewrited a bit with separation of `read` and `mmap` cases. Info about minor and major pagefaults is added. — osgx, Apr 27 '14 at 01:51
PS: As we know from http://stackoverflow.com/a/23317928/196561 answer, [`getrusage`](http://man7.org/linux/man-pages/man2/getrusage.2.html) reports count of read/written data blocks via `read()`/`write()` in `ru_inblock` and `ru_oublock` fields of `struct rusage`, instead of `ru_minflt` and `ru_majflt` in case of `mmap(file,..)` + pagefault for reading data. — osgx, Apr 28 '14 at 14:22

read() system call page fault doesn't depend on file size

1 Answers1