0

I understand that RAM latency can be 100ns, but I'd like to speed up a program which needs to basically just read from random locations within a huge 4GB array. It seems reading could be pipelined or, alternatively (because my full program does a little writing to RAM also), I'd be happy to get "old bits" from RAM and do my own check that I have not recently changed these bits.

Are there any solutions to get faster throughput? I am willing to program in assembly, or even change my hardware, but my first hope is that I could do this on standard Intel/AMD hardware through Visual Studio C++. Please see and test my simple program below - I want to get the read time down from my current 80ns to 2ns!

(By the way, if I reduce the RAM usage from 4GB to 16MB, times fall to 10ns. Some speed up is expected, but 8x is surprising. Maybe the compiler is using some L2 cache tricks...anyway, 10ns is still far short of physical limits.)

C++ Code:

#include "pch.h"
#include <iostream>
#include <chrono>     // just for execution time measurement

#define RANDOM_COUNT  ((unsigned long long) 1 << 24)
#define RAM_SIZE      ((unsigned long long) 1 << 29)
//#define FAST_MOD      % RAM_SIZE
#define FAST_MOD      & (RAM_SIZE-1)

int main()
{
    std::chrono::steady_clock::time_point t1, t2;
    unsigned long long *ram = new unsigned long long[RAM_SIZE]; memset(ram, 0, RAM_SIZE * sizeof(unsigned long long));
    for (unsigned long long i = 0; i < 10; i++) {
        unsigned long long      random = (  rand()*rand()*rand()  )  FAST_MOD;
        unsigned long long  odd_random = (2*rand()*rand()*rand()+1);
        unsigned long long         sum =  0;
        t1 = std::chrono::high_resolution_clock::now();
        for (unsigned long long j = 0; j < RANDOM_COUNT; j++) {
            sum += ram[random];
            random = (random + odd_random)  FAST_MOD;
        }
        t2 = std::chrono::high_resolution_clock::now();
        std::cout << "\nns per read :  " << (std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() / ((float)RANDOM_COUNT)) << "   0==" << sum;
    }
    delete[] ram;
}



Output:

ns per read :  83.8036   0==0
ns per read :  85.3504   0==0
ns per read :  85.3037   0==0
ns per read :  84.6396   0==0
ns per read :  78.5159   0==0
ns per read :  83.3926   0==0
ns per read :  85.8171   0==0
ns per read :  84.8495   0==0
ns per read :  85.7676   0==0
ns per read :  85.4356   0==0
bobuhito
  • 285
  • 2
  • 20
  • 6
    Random access is the bane of fast throughput. You need something nice and predictable so the prefetcher has a chance of getting the memory you need into the cache. Any way you can change your workflow to do sequential access? – NathanOliver Dec 18 '19 at 15:36
  • 3
    The compiler does not use "some L2 cache tricks". 16 MB is well within the cache size of modern processors. – Botje Dec 18 '19 at 15:40
  • No, I can't make it sequential without fully loading the file and re-saving it...and that takes even more time. Also, my PC has L2 cache of only 2MB (no L3 cache). – bobuhito Dec 18 '19 at 15:42
  • @bobuhito There are other aspects of memory accesses that can become faster for smaller working sets, e.g. the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), which is itself a cache of sorts. You're likely to get tons of TLB misses for random 4 GB accesses. – Max Langhof Dec 18 '19 at 15:46
  • 5
    There is not much you can do then. You need a way to get the data into the cache and the only way to do that is to create a pattern the CPU can identify and leverage to work ahead of you. When you have random access, you basically stop that process and suffer a cache miss on every access forcing you to slow down to main memory speeds. Essentially you've turned your vector into a linked list (which is practically the worst data structure for performance). – NathanOliver Dec 18 '19 at 15:46
  • If you can compute the locations that you will eventually access beforehand (well, ~100 ns worth of iterations before needing them), you might be able to utilize prefetching. But even then your throughput will be many times lower than sequential reads. And manual prefetching in such an unpredictable environment is playing with fire. – Max Langhof Dec 18 '19 at 15:48
  • Consider using an AVX scatter operation. On modern micro architectures, they [considerably improve the throughput](https://stackoverflow.com/a/54137942/417501) over individual loads. – fuz Feb 03 '20 at 15:43

0 Answers0