Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

Question

I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.

for(int i=0; i < 1000000; i++){
        array[i].value=2;
}

For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5 counters that I got with papi_native_avail, supposedly each for different channels.

The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.

But UNC_M_WPQ_INSERTS returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.

Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?

My only hypothesis are: 1) That counter only counts for DRAM (somehow writes to NVDIMMs are treated differently), 2) the writes are not making it to the MC, either because they stop before (how so?) or because there is an internal, separate, NVDIMM controller in the MC itself. 3) the NVDIMMs are behind a (possibly integrated) PCIe controller (which is the real target of the writes, in place of the MC). But I don't see how this list is going to help you :D I mean those writes cannot simply disappear! — Margaret Bloom, Mar 24 '20 at 15:04
@MargaretBloom I agree, they have to go somewhere. I start to think that it's somehow the counter that maybe only monitors DRAM writes, though it does not make good sense to me why it would do so. There is this paper: https://arxiv.org/pdf/1908.03583.pdf which describes some details of the architecture, in section 2.1.1. What they are saying is that iMC maintains read and write pending queues (RPQs and WPQs) for each of the 3D XPoint DIMMs, so I would have been less surprised if only NVM writes were counted instead of DRAM. — Ana Khorguani, Mar 24 '20 at 15:30
Have you tried monitoring all the counters from all the iMC? Could it be that the writes are routed to a different iMC than expected? — Margaret Bloom, Mar 24 '20 at 16:59
I have tried all 6 of them from skx_unc_imc0-5, but there is nothing for NVM. For DRAM each gets 1/6. — Ana Khorguani, Mar 24 '20 at 17:27
@MargaretBloom It turns out that your first hypothesis is correct. I found here: https://download.01.org/perfmon/index/cascadelake_server.html that there is another counter UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for Intel® Optane™ DC persistent memory. But unfortunately papi_native_avail does not show me this event in none of the cascade lake machines that I have access to, that's why I did not even check it before :( — Ana Khorguani, Mar 24 '20 at 20:17
Is there any way, this `UNC_M_PMM_WPQ_INSERTS` hardware counter can be available in machine but not showing up in `papi_native_avail`? and used somehow else? — Ana Khorguani, Mar 24 '20 at 20:32
Looking in the source code it seems PAPI isn't aware of that event. Maybe you can try emailing them or we can try to hack that event into the source code :D — Margaret Bloom, Mar 25 '20 at 08:16
I like the idea :D I will send them an email. Hacking their source code seems fun too. However I was thinking that maybe, there is a way to check that the event is supported by the machine itself? Even though in Intel page there is this new event, it might not be integrated in all the cascade lake machines? — Ana Khorguani, Mar 25 '20 at 08:45
I think it is in all CascadeLakes. However, I was unable to find any official documentation. The uncore PMU documentation seems to be missing. Linux `perf` [does support these new events though](https://dyninst.github.io/scalable_tools_workshop/petascale2019/assets/slides/CSCADS%202019%20perf_events%20status%20update.pdf), so maybe you can check with it? I think PAPI is missing them because they may not be "in the loop" with Intel. — Margaret Bloom, Mar 25 '20 at 10:45
I am checking with perf right now. What I have found so far with `perf list uncore` is the following: `unc_m_pmm_bandwidth.write` - [Intel Optane DC persistent memory bandwidth write (MB/sec). Derived from `unc_m_pmm_wpq_inserts`. Unit: uncore_imc] which means that unc_m_pmm_wpq_inserts should be somewhere there :D — Ana Khorguani, Mar 25 '20 at 11:09
For now, `perf stat -e unc_m_pmm_bandwidth.write` returns result, but when I try directly `perf stat -e unc_m_pmm_wpq_inserts` then I get parser error. — Ana Khorguani, Mar 25 '20 at 11:20
`perf` supports raw events, so, according to [this](https://github.com/torvalds/linux/blob/master/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json#L88), the command `perf stat -e uncore_imc/event=0xe7` should count the WPQ inserts for the PMM. You should be able to get more indepths about the raw format used by `perf` (particularly the PMU name) with the command `ls /sys/devices/*/format`. I'm not sure how that all works in a NUMA machine :/ — Margaret Bloom, Mar 25 '20 at 15:04
Wow great, thanks a lot, something started to work :D I used `perf stat -e uncore_imc/event=0xe7/` like this and it returns some value. For the beginning I am trying to see how to add EVENT MODIFIERS to this raw format to get only user-space counting. with name it would be simply adding :u like this: `perf stat -e unc_m_pmm_wpq_inserts:u` but seems that with raw version it's not that simple. — Ana Khorguani, Mar 25 '20 at 15:57
It should be something like `/u` at the end. But do you think the kernel is writing to the NVRAM? — Margaret Bloom, Mar 25 '20 at 16:00
Ok, I tried adding `perf stat -e uncore_imc/event=0xe7/u` it returns sth like , which I think means that the event itself does not support this event identifier. — Ana Khorguani, Mar 25 '20 at 16:07
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/210320/discussion-between-margaret-bloom-and-ana-khorguani). — Margaret Bloom, Mar 25 '20 at 16:12
There could be undocumented events for PMM WPQ full cycles and PMM WPQ not empty cycles. You check whether the IMC events 0xE5, 0xE6, 0xE8, and 0xE9 have counts larger than zero. Try also different variations of your program, such as larger array element sizes (128 bytes, 256 bytes, and 512 bytes) and unroll the loop some number of times. Alternatively, it may be possible to approximate these events using `UNC_M_PMM_WPQ_OCCUPANCY.ALL` / `UNC_M_CLOCKTICKS` and maybe the modifier `cmask=1`. Note that the `u` modifier cannot be used with uncore events. — Hadi Brais, Apr 19 '20 at 20:10
@HadiBrais Hi, thank you very much for the suggestions. I will apply all of them before the end of the week and I'll comment the results as well. — Ana Khorguani, Apr 21 '20 at 14:30
@HadiBrais hi, sorry for such a late reply. I tested now, 0xE5 counts 68,303,263 for the loop above of 1 million writes. The rest of the counters return 0. I tried adding few more loops of writes, but it did not change anything other than 0xE5 result proportionally. Same for element sizes. But what counters are they? I was not able to find them [here](https://github.com/torvalds/linux/blob/master/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json#L88) — Ana Khorguani, Apr 24 '20 at 13:09
I can count UNC_M_CLOCKTICKS with name. For UNC_M_PMM_WPQ_OCCUPANCY.ALL I used 0xE4 though it returns 0. adding cmask like this: `uncore_imc/event=0xe4,cmask=0x1/` gives event syntax error. — Ana Khorguani, Apr 24 '20 at 13:42
Nice. The events 0xE5, 0xE6, 0xE8, and 0xE9 are undocumented. Can you try with a read loop instead of write loop? The event `UNC_M_PMM_WPQ_OCCUPANCY.ALL` requires `umask=0x1`. You can only use it by name in kernel v5.5-rc1 and later. What is the value of `UNC_M_CLOCKTICKS`? — Hadi Brais, Apr 24 '20 at 16:32
Ah, ok I see. Well UNC_M_CLOCKTICKS count is 450,584,412 for this write loop. For UNC_M_PMM_WPQ_OCCUPANCY.ALL `uncore_imc/event=0xe4,umask=0x1/` returns value: 959,793,105. I added read loop after write as well, and 0xE6, 0xE8, and 0xE9 counters are still 0. 0xE5 result is same. — Ana Khorguani, Apr 24 '20 at 19:21
@HadiBrais For evaluating UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_CLOCKTICKS based on the results it seems to be around 2. I'm not sure if this is somethig as expected or not. Also I think I'll try to mess with kernel version in some time soon, to update it and see if with the newer version I get something else. — Ana Khorguani, May 01 '20 at 22:31
Well, a ratio of 2 means that the average WPQ occupancy is only 2, which is much smaller than what I expected. Try with a larger number of iterations, such as 10 million or 1 billion. Try also measuring `uncore_imc/event=0xe4,umask=0x1,cmask=0x1/` and see how it compares against cmask=0x0. — Hadi Brais, May 02 '20 at 00:16
@HadiBrais Unfortunately this: `uncore_imc/event=0xe4,umask=0x1,cmask=0x1/` does not seem to work, I get event syntax error. For UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_CLOCKTICKS with 10 million writes the ratio goes up to 3. It's also 3 with 100 million writes. with 1 billion it's 2 again. — Ana Khorguani, May 02 '20 at 11:14
I tried adding flush instruction and memory fence after write operation inside the loop. for 1 million writes, I get 108,109,453 `uncore_imc/event=0xe4,umask=0x1/` and 2,164,543,760 for `UNC_M_CLOCKTICKS`. Without memory fence just `clwb` gives 514,017,990 `UNC_M_CLOCKTICKS` and 218,640,321 `uncore_imc/event=0xe4,umask=0x1/`. I expected UNC_M_PMM_WPQ_OCCUPANCY.ALL to increase, since I am flushing writes should saturate this WPQ more, but it does not look so. — Ana Khorguani, May 02 '20 at 11:18
@MargaretBloom I was just reading this answer: https://stackoverflow.com/a/50322404/12419816. Here the author seems to state that an `mfence` is the instruction that is guaranteed to flush the store buffer. Does an `sfence` do this too? — Suraaj K S, Nov 12 '20 at 11:36
@SuraajKS The Intel's manual says it does the tests performed by other users say it doesn't. I asked your exact question in the comments (the very last ones) of the answer you linked. I've never tested it, personally. — Margaret Bloom, Nov 12 '20 at 19:12
@AnaKhorguani If you're still looking for events that represent "PMM WPQ full cycles" and "PMM WPQ not empty cycles," my earlier suggestions of using the events 0xE6 and 0xE5 are correct according to [https://download.01.org/perfmon/CLX/cascadelakex_uncore_v1.11_experimental.json](https://download.01.org/perfmon/CLX/cascadelakex_uncore_v1.11_experimental.json). These are called UNC_M_PMM_WPQ_CYCLES_FULL and UNC_M_PMM_WPQ_CYCLES_NE, respectively. — Hadi Brais, Mar 11 '21 at 19:31
@HadiBrais Hello, thank you for the document. I went back to checking this counters. It's a bit intriguing since I have not managed to make "0xE6" count something other than 0 for any program :D which implies that the Write Pending Queue does not get filled. "0xE5" is closer to expected behavior (reduces significantly or reaches 0 when binding memory to DRAM and reaches billions when memory is bound to NVM) — Ana Khorguani, May 26 '21 at 14:25
Yes I think it's not getting filled. I currently don't have access to a system with pmem to test some of the microbenchmarks that I think would get it full, but it's an interesting and important microarchitectural analysis problem. — Hadi Brais, May 29 '21 at 00:06
@HadiBrais I agree. If it is really illustrating that the Write Pending Queue does not get full, it means that the memory controller for the NVM does not become a bottleneck. If those microbenchmarks are accessible and easy to run, I could test them. — Ana Khorguani, Jun 02 '21 at 11:46

score 2 · Answer 1 · answered Mar 25 '20 at 21:23

It turns out that UNC_M_WPQ_INSERTS counts the number of allocations into the Write Pending Queue, only for writes to DRAM. Intel has added corresponding hardware counter for Persistent Memory: UNC_M_PMM_WPQ_INSERTS which counts write requests allocated in the PMM Write Pending Queue for Intel® Optane™ DC persistent memory.

However there is no such native event showing up in papi_native_avail which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found in perf list uncore such as unc_m_pmm_bandwidth.write - Intel Optane DC persistent memory bandwidth write (MB/sec), derived from unc_m_pmm_wpq_inserts, unit: uncore_imc. This implies that even though UNC_M_PMM_WPQ_INSERTS is not directly listed in perf list as an event, it should exist on the machine.

As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following: perf stat -e uncore_imc/event=0xe7/. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result of perf kind of makes sense:

Performance counter stats for 'system wide':  1,035,380    uncore_imc/event=0xe7/

So far this seems to be the the best guess.

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

1 Answers1

Linked