What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

Question

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.

But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.

During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?

#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"

//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem

int main()
{
    size_t mapped_len;
    char str[32];
    int is_pmem;
    sprintf(str, "/mnt/pmem/pmmap_file_1");
    int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
    if (p == NULL)
    {
        printf("map file fail!");
        exit(1);
    }
    if (!is_pmem)
    {
        printf("map file fail!");
        exit(1);
    }

    struct timeval start;
    struct timeval end;
    unsigned long diff;
    int loop_num = 10000;

    _mm_mfence();
    gettimeofday(&start, NULL);

    for (int i = 0; i < loop_num; i++)
    {
        p[i] = 0x2222;
        _mm_clwb(p + i);
        // _mm_stream_si64(p + i, 0x2222);
        _mm_sfence();
    }

    gettimeofday(&end, NULL);

    diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;

    printf("Total time is %ld us\n", diff);
    printf("Latency is %ld ns\n", diff * 1000 / loop_num);

    return 0;
}

Any help or correction is much appreciated!

Your "In this paper" link is https://stackoverflow.com/. What paper did you mean? — Peter Cordes, Mar 29 '21 at 17:03
I'd expect `sfence` after every qword write to seriously hurt memory-level parallelism. Especially since that means you're doing a partial-line NT store, because there are 8 qwords in a cache line and you're doing sfence after only one of them. IIRC, multiple back-to-back writes to the same line is also particularly bad for Optane. — Peter Cordes, Mar 29 '21 at 17:06
@Peter Cordes I update the link. `sfence` is exactly the same as you said, but in the most of actual situations, you must do `sfence` after each `ntstore`/`clwb` to ensure persistent consistency.That paper is also measure the latency in this way(but with `mfence` instead of `sfence`, stronger fence). For `clwb`, with or without `sfence` only makes a difference of 100ns. `ntstore`'s latency without `sfence` is only 16ns, because `ntstore` without `sfence` is equivalent to almost no persistence. So I still don’t know why it differs so much from the data in the paper. Thanks! — dangzzz, Mar 30 '21 at 03:25
Yeah, that makes sense that encapsulation / software architecture makes it hard to defer an sfence between stores that don't need to be persistently ordered. You still usually only need to do a limited amount of persistent stores between other work, though, right? So there could still be a significant difference between this test and real workloads. And it's certainly not the *best* case anymore. — Peter Cordes, Mar 30 '21 at 03:39

grayxu · Accepted Answer · 2022-04-14T13:14:04.943

3

The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.

appended on 4.14

Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:

Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]

[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.

edited Apr 14 '22 at 13:14

answered Apr 08 '22 at 08:38

grayxu

84
6

Yes, and this paper (https://dl.acm.org/doi/pdf/10.1145/3492321.3519556) goes on to discuss the reason for the abnormal latency. – dangzzz Apr 14 '22 at 01:32
@dangzzz and Yargee: this isn't strictly testing average *latency*, it's testing throughput of flush / barrier operations when hammering away just doing this. It depends on the CPU how much memory-level parallelism is available for this operation, whether multiple store + clwb transactions can be in flight at once, like a grocery store conveyor belt with dividers (sfence barriers) between store+clwb pairs. (The fact that `clwb` runs as `clflush` on CPUs before Ice Lake probably hurts a lot, and may mean it is measuring flush/reload latency.) – Peter Cordes Apr 14 '22 at 05:29
@PeterCordes though `clwb` is async, `fence` can make those clwb running without queuing (And the paper used the exactly same method). "`clwb` runs as `clfush` before Ice Lake": appreciate it if you can share related reference of that. PMDK doc claimed that "CLWB is the same as CLFLUSHOPT except that cacheline may remain valid". (*But [this eurosys22 paper](dl.acm.org/doi/pdf/10.1145/3492321.3519556) shows `clwb` will invalid cacheline on scalable 2nd platforms.*) – grayxu Apr 14 '22 at 08:35
See my answer on this question, which yours basically just summarized. [Intel's CLWB instruction invalidating cache lines](https://stackoverflow.com/q/60266778) is linked in a footnote. Architecturally `clwb` is *allowed* to retain the cache line (and documented as always doing that), but in practice it unfortunately doesn't. IMO that's still better than gimping future code by having it fault on SKX / CSL, instead of running as `clflush`. Otherwise PMDK libraries would have to do runtime detection even years from now. – Peter Cordes Apr 14 '22 at 09:46
Oh... I guess I mistakenly equated "`clwb` runs as `clflush`" with "`clwb` does not run as `clflushopt`"... :) – grayxu Apr 14 '22 at 12:51
Oh sorry, yes `clflushopt`, not strongly-ordered `clflush`. – Peter Cordes Apr 14 '22 at 13:17
Besides, in his scenario, `clwb`+`fence` are exactly testing **the avg latency**. Ofc SRAM buffer is the reason why latency is hided, but in this single thread loop, the latency will be stable no matter how big are the working sets. – grayxu Apr 14 '22 at 13:18

score 2 · Answer 2 · answered Mar 30 '21 at 03:34

2

https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write¹. So the CPU-pipeline latency of getting a store "accepted" into something persistent.

This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:

Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.

You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.

Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)

Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.

During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?

Yes, that would be my guess.

If this is the case, is there a tool to detect it?

I don't know, sorry.

answered Mar 30 '21 at 03:34

Peter Cordes

328,167
45
605
847

1

I find that "rewriting the same line repeatedly seems to be slower than sequential write",too. And store + clwb + sfence with different line is 230ns in my loop, which matches the data in the paper, since my loop is cache-cold case(load before store). Thanks, you helped me many times! – dangzzz Mar 30 '21 at 06:07
1

@dangzzz I don't think `pmem_map_file` prefaults the allocated memory and you're not doing any initialization. This is probably the biggest cause of discrepancy between your measurements and the numbers reported in the paper. Another important cause are different frequencies in different frequency domains in your processors and the one used in the paper. Where in the paper does it say that the latency of "store + clwb + sfence" is 230ns? Didn't' you say in the question that the latency of this case is 700ns? Also the numbers 57ns and 62ns don't match Figure 2. – Hadi Brais Mar 30 '21 at 14:57
1

@HadiBrais I have tested the impact of page-faults that I do the loop twice, and only measured the second one. It makes no difference. And with linux `perf` tools, I see page-faults only takes 3% of total cycles. 57ns and 62ns is my mistake in writing, it's actually 90ns and 62ns. – dangzzz Mar 31 '21 at 07:05
@HadiBrais Since in my loop, I need read the cacheline before clwb, I think the 230ns is okay. 230ns approximately equal to 160ns(read latency in Figure 2) plus 62ns(clwb latency in Figure 2). – dangzzz Mar 31 '21 at 07:06
@dangzzz The authors published their code on GitHub and it looks like [this](https://github.com/NVSL/OptaneStudy/blob/ea2f4cf4715b46f7c2e701127159ef199ab3a0e1/src/kernel/common.h#L166) is the code for the `store+clwb` case. The cache line is first loaded in the L1D outside of the timed region, and two `vmovdqa` followed by `clwb` are executed on that line within the timed region. `rdtscp` is used to measure time. They presumably fixed the frequency to minimize variability and make it easier to convert to nanoseconds. – Hadi Brais Mar 31 '21 at 13:35
In contrast, you're doing 8 stores to the same line and a `clwb` for each store. It looks like there is a bug in the `TIMING_END` sequence because the upper 32-bits of the TSC value is lost when executing `mov %%eax, %%edx`. I'm not sure what the point of `CLEAR_PIPELINE` is. This bench is being called from [here](https://github.com/NVSL/OptaneStudy/blob/227d9062b2cd791a940d3db7cff9dc8a4b700db9/src/kernel/tasks.c#L141). The `BENCHMARK_END` sequence is also buggy because calling `local_irq_enable()` destroys the previous interrupt state. – Hadi Brais Mar 31 '21 at 13:35
1

In the `BENCHMARK_BEGIN` sequence, `local_irq_disable(`) is redundant. In the `drop_cache()` function, `mfence` is redundant. I know the paper says the bench does 64-bit stores but the code is doing two 32-bytes stores. Maybe it's an error in the paper. Regarding page faults, If there is no significant difference, then it either means that `pmem_map_file` has prefaulted the buffer or that the cost of a page fault is very small compared to the time it takes doing store+clwb over an entire page. – Hadi Brais Mar 31 '21 at 13:36
Another effect which I have not mentioned earlier is that if the first store of each page triggered a fault, the Linux kernel will allocate a page and zero it, so it becomes hot in the L1D. Then each first store to each line in the page will hit in the L1D but `clwb` evicts the line, so later stores to the same line miss. Note that `sfence` may not prevent speculative RFOs from occurring. `sfence` and `clwb` may not inhibit the L2 streaming prefetcher. Consecutive stores may overlap to a certain extent. – Hadi Brais Mar 31 '21 at 13:36
@HadiBrais: The zeroed-page behaviour you describe is for unwritten extents (or sparse files), right? Which is what you get with PMEM_FILE_CREATE *if* the file doesn't already exist. But the OP's code doesn't unlink `/mnt/pmem/pmmap_file_1` so subsequent runs should be writing the same file, and will have to pull in data from pmem on fault, unless I'm missing something. – Peter Cordes Mar 31 '21 at 14:13
1

Yeah I forgot here that we are talking about a mapped file and not normal memory. I looked at the implementation of `pmem_map_file` on Linux. It basically first calls `open()` with the flags `O_CREAT | O_RDWR` and passes the file descriptor to `mmap` with the flags `MAP_SHARED_VALIDATE | MAP_SYNC`. The resulting address is returned by `pmem_map_file`. I don't think it prefaults. It may also use huge pages. – Hadi Brais Mar 31 '21 at 19:28

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

2 Answers2