What does the "page-faults" in `perf stat` mean with respect to "backend cycles idle"?

Question

I wrote a simple matrix multiplication programme to demonstrate the effects of caching and performance measuring via perf. The two functions for comparing the relative efficiencies are sqmat_mult and sqmat_mult_efficient:

void sqmat_mult(int x, const int a[x][x], const int b[x][x], int m[x][x])
{
    for (int i = 0; i < x; i++) {
        for (int j = 0; j < x; j++) {
            int sum = 0;
            for (int k = 0; k < x; k++) {
                sum += a[i][k] * b[k][j]; // access of b array is non sequential
            }
            m[i][j] = sum;
        }
    }
}


static void sqmat_transpose(int x, int a[x][x])
{
    for (int i = 0; i < x; i++) {
        for (int j = i+1; j < x; j++) {
            int temp = a[i][j];
            a[i][j] = a[j][i];
            a[j][i] = temp;
        }
    }     
}

void sqmat_mult_efficient(int x, const int a[x][x], int b[x][x], int m[x][x])
{
    sqmat_transpose(x, b);

    for (int i = 0; i < x; i++) {
        for (int j = 0; j < x; j++) {
            int sum = 0;
            for (int k = 0; k < x; k++) {
                sum += a[i][k] * b[j][k];  // access of b array is sequential
            }
            m[i][j] = sum;
        }
    }

    sqmat_transpose(x, b);
}

However, when I run perf stat based on these two functions, I am confused about the "page-faults" statistics. The output for sqmat_mult is:

     428374.070363      task-clock (msec)         #    1.000 CPUs utilized          
               128      context-switches          #    0.000 K/sec                  
               127      cpu-migrations            #    0.000 K/sec                  
            12,334      page-faults               #    0.029 K/sec                  
 8,63,12,33,75,858      cycles                    #    2.015 GHz                      (83.33%)
    2,89,73,31,370      stalled-cycles-frontend   #    0.34% frontend cycles idle     (83.33%)
 7,90,36,10,13,864      stalled-cycles-backend    #   91.57% backend cycles idle      (33.34%)
 2,24,41,64,76,049      instructions              #    0.26  insn per cycle         
                                                  #    3.52  stalled cycles per insn  (50.01%)
    8,84,30,79,219      branches                  #   20.643 M/sec                    (66.67%)
       1,04,85,342      branch-misses             #    0.12% of all branches          (83.34%)

     428.396804101 seconds time elapsed

While the output for sqmat_mult_efficient is:

       8534.611199      task-clock (msec)         #    1.000 CPUs utilized          
      39876.726670      task-clock (msec)         #    1.000 CPUs utilized          
                11      context-switches          #    0.000 K/sec                  
                11      cpu-migrations            #    0.000 K/sec                  
            12,334      page-faults               #    0.309 K/sec                  
 1,19,87,36,75,397      cycles                    #    3.006 GHz                      (83.33%)
      49,19,07,231      stalled-cycles-frontend   #    0.41% frontend cycles idle     (83.33%)
   50,40,53,90,525      stalled-cycles-backend    #   42.05% backend cycles idle      (33.35%)
 2,24,10,77,52,356      instructions              #    1.87  insn per cycle         
                                                  #    0.22  stalled cycles per insn  (50.01%)
    8,75,27,87,494      branches                  #  219.496 M/sec                    (66.68%)
         50,48,390      branch-misses             #    0.06% of all branches          (83.33%)

      39.879816492 seconds time elapsed

With the transposed based multiplication, the "backend cycles idle" is considerably reduced as expected. This is due to the low wait time of fetching coherent memory. What confuses me is that the number of "page-faults" being the same. At first I thought this might be a context switching issue, so I made the programme run with scheduling SCHED_FIFO and priority 1. The number of context switches decreased from roughly 800 to 11, but the "page-faults" remanined exactly the same as before.

The dimension of the matrix I am using is 2048 x 2048, so the entire array will not fit in a 4K page. Nor does it fit in my entire cache (4MiB in my case), since the size of the three matrices (3*2048*2048*sizeof(int) == 48MiB) far exceeds the total avaibale cache. Assuming that "page-faults" here refers to TLB miss or cache miss, I don't see how the two algorithms can have the exact same number. I understand that a lot of the efficiency improvement I see is from the fact that L1 cache hit is happening but that does not mean that RAM to cache transfers have no role to play.

I also did another experiment to see if the "page-faults" scale with the array-size - It does, as expected.

So my questions are -

Am I correct in assuming that "page-faults" refer to a TLB miss or a cache-miss?
How come the "page-faults" are exactly the same for both algorithms?

[This question](https://stackoverflow.com/questions/37825859/cache-miss-a-tlb-miss-and-page-fault) goes into the difference of page-fault, TLB miss and cache miss. The size of your working set divided by the page size (4k) is pretty much the number of page faults in your case. The algorithm makes no difference — Zulan, May 14 '18 at 09:16
Possible duplicate of [cache miss, a TLB miss and page fault](https://stackoverflow.com/questions/37825859/cache-miss-a-tlb-miss-and-page-fault) — Zulan, May 14 '18 at 09:17
@Zulan I am not sure why you think that the algorithm's memory access pattern has nothing to do with page faults - My question is precisely that. — tinkerbeast, May 14 '18 at 09:29
Ok, I'll try to clarify for your specific example. The answers on the linked question are unfortunately problematic. — Zulan, May 14 '18 at 10:30
To reduce page-faults, allocate in big enough chunks for the kernel to use anonymous hugepages. If you don't have it set to defrag physical memory by default to find contiguous 2M chunks, use `madvise(MADV_HUGEPAGE)` to encourage it to defrag for that region. Also, you can allocate memory with `mmap(MAP_POPULATE)` to "prefault" it so you won't get any pagefaults when touching it, only TLB misses. @Zulan, you might want to mention this in your answer. — Peter Cordes, May 14 '18 at 19:28
@PeterCordes This is good advanced information. However, I wouldn't want to give the impression that there is any problem about the exposed page faults. In the vast majority of the cases, one shouldn't do anything about it. If it actually is helpful, e.g. for isolated benchmarks, I would rather generically touch all memory instead of applying non-portable techniques. This also helps with cache & NUMA. — Zulan, May 15 '18 at 11:49

score 2 · Answer 1 · answered May 14 '18 at 10:46

Am I correct in assuming that "page-faults" refer to a TLB miss or a cache-miss?

No. A page fault happens when the virtual memory is not mapped to physical memory. There are two typical¹ reasons for this:

Data was moved from physical memory to disk (swapping). When the page fault happens, the data will be moved back to physical memory and a mapping from virtual memory will be restored.
A memory region was lazily allocated. Only upon the first allocation the actual physical memory will allocated to the corresponding virtual memory.

Now a page fault will probably² only happen after a cache miss and a TLB miss, because the TLB is only storing actual mappings and a cache will only contain data that is also in the main memory.

Here is a simplified chart of what's happening:

Image by Aravind krishna CC BY-SA 4.0, from Wikimedia Commons

How come the "page-faults" are exactly the same for both algorithms?

All your data fits into the physical memory, so there is no swapping in your case. This leaves you with 2), which happens once for each virtual page. You use 48 MiB of memory, which uses 12288 pages of 4 kiB size. This is the same for both algorithms and almost exactly what you measure.

¹ Additional reasons can come up with memory mapped IO or NUMA balancing.

² I suppose you could build an architecture where the TLB contains information unmapped virtual addresses or the memory is exclusive of a cache, but I doubt that such an architecture exists.

I have a highly-concurrent application that also observes a huge number of page faults, but all the memory is pre-allocated and pre-faulted, and it does not do disk or network I/O, and I have used `numactl` to bind the memory to one UNMA node. Is there any other reason? — 1a1a11a, Mar 28 '23 at 14:06

What does the "page-faults" in `perf stat` mean with respect to "backend cycles idle"?

1 Answers1