1

While using Intel Advisor's roofline analysis view, we are presented data-bandwidth information for the different data-paths of the system i.e. DRAM, L3-, L2- and L1 caches. The program claims that it measures the bandwidths on the provided hardware i.e. these aren't theoretical estimates or information from the OS.

Question

Why is the DRAM bandwidth 25 GB/s for a single thread?

enter image description here

Code (for Intel compiler)

In order to see how much data the computer can lift in the shortest possible time using all the computational resources available, one could conceptualize a first-attempt:

    // test-parameters
    const auto size = std::size_t{50 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }

        // clear cache
        double* cache_clearer = nullptr;
        posix_memalign(reinterpret_cast<void**>(&cache_clearer), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (cache_clearer == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma optimize("", off)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            cache_clearer[index] = -1.0;
        }
        #pragma optimize("", on)
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
        free(cache_clearer);
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }

        // clear cache
        double* cache_clearer = nullptr;
        posix_memalign(reinterpret_cast<void**>(&cache_clearer), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (cache_clearer == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma optimize("", off)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            cache_clearer[index] = -1.0;
        }
        #pragma optimize("", on)
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
        free(cache_clearer);
    }

Notes on code:

  1. Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance,
  2. using compiler flags -O3 -ffast-math -march=native,
  3. size is to be bigger than lowest level cache of system (here 50 MB),
  4. new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits),
  5. minimum latency is recorded to counter-act the effects of OS-scheduling: threads being taken off cores for a short while etc.,
  6. a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the userspace governor).

Results of code

On my machine, using AVX2 instructions (highest vector instructions available), I am realizing a max. bandwidth of 5.6 GB/s.

EDIT

Following @Peter Cordes' comment, I adapted my code to make sure memory page placement has taken place. Now my measured BW is 90 GB/s. Any explanation why its so high?

Nitin Malapally
  • 534
  • 2
  • 10
  • 25GiB/s for a single thread is reasonable for a "client" chip (desktop/laptop), not for a many-core Xeon. See [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) Especially not for a Skylake-X or later with the mesh interconnect, which scales well for many cores but has poor latency and bandwidth for each core. If you're only seeing 5.6 GB/s on a 6-core machine, you're probably bottlenecking on something else. Even memset (which really costs read+write bandwidth if not using NT stores) should be faster. – Peter Cordes Mar 10 '22 at 12:13
  • What CPU mode? Do you have memory installed on both channels so it can run in dual-channel mode? – Peter Cordes Mar 10 '22 at 12:16
  • Oh, you did a free and fresh allocation between warm-up and memset, so you're paying page-fault costs inside the timed region! No wonder it's much slower than actual memory bandwidth. – Peter Cordes Mar 10 '22 at 12:17
  • 1
    The processor involved here is the Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0 GHz) [Coffee Lake]. – Nitin Malapally Mar 10 '22 at 12:21
  • Re. CPU Mode: I should think so. Either way, Intel Advisor is working with the same configuration and measuring a different value. – Nitin Malapally Mar 10 '22 at 12:22
  • I've now updated my code and am realizing a single-thread BW of 90 GB/s. Any idea why this may be the case? I was hoping that my trick with the `cache_clearer` would be effective... – Nitin Malapally Mar 10 '22 at 12:45
  • Did you check the compiler's asm output? With all your stores being "dead", it's possible the compiler hoisted the stores out of the timed region into the first pass over the data. But that's unlikely given there are non-inline function calls between those loops. (I don't see the point of your "cache_clearer" loop other than that, though, vs. just using a larger array for the timed thing.) Even if it optimized to use NT stores, 90 GB/s sounds too high for dual-channel DDR4. And the total L2 cache size isn't high enough to account for it. – Peter Cordes Mar 10 '22 at 12:51
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242809/discussion-between-motivehunter-and-peter-cordes). – Nitin Malapally Mar 10 '22 at 13:02

0 Answers0