I understand that RAM latency can be 100ns, but I'd like to speed up a program which needs to basically just read from random locations within a huge 4GB array. It seems reading could be pipelined or, alternatively (because my full program does a little writing to RAM also), I'd be happy to get "old bits" from RAM and do my own check that I have not recently changed these bits.
Are there any solutions to get faster throughput? I am willing to program in assembly, or even change my hardware, but my first hope is that I could do this on standard Intel/AMD hardware through Visual Studio C++. Please see and test my simple program below - I want to get the read time down from my current 80ns to 2ns!
(By the way, if I reduce the RAM usage from 4GB to 16MB, times fall to 10ns. Some speed up is expected, but 8x is surprising. Maybe the compiler is using some L2 cache tricks...anyway, 10ns is still far short of physical limits.)
C++ Code:
#include "pch.h"
#include <iostream>
#include <chrono> // just for execution time measurement
#define RANDOM_COUNT ((unsigned long long) 1 << 24)
#define RAM_SIZE ((unsigned long long) 1 << 29)
//#define FAST_MOD % RAM_SIZE
#define FAST_MOD & (RAM_SIZE-1)
int main()
{
std::chrono::steady_clock::time_point t1, t2;
unsigned long long *ram = new unsigned long long[RAM_SIZE]; memset(ram, 0, RAM_SIZE * sizeof(unsigned long long));
for (unsigned long long i = 0; i < 10; i++) {
unsigned long long random = ( rand()*rand()*rand() ) FAST_MOD;
unsigned long long odd_random = (2*rand()*rand()*rand()+1);
unsigned long long sum = 0;
t1 = std::chrono::high_resolution_clock::now();
for (unsigned long long j = 0; j < RANDOM_COUNT; j++) {
sum += ram[random];
random = (random + odd_random) FAST_MOD;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "\nns per read : " << (std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() / ((float)RANDOM_COUNT)) << " 0==" << sum;
}
delete[] ram;
}
Output:
ns per read : 83.8036 0==0
ns per read : 85.3504 0==0
ns per read : 85.3037 0==0
ns per read : 84.6396 0==0
ns per read : 78.5159 0==0
ns per read : 83.3926 0==0
ns per read : 85.8171 0==0
ns per read : 84.8495 0==0
ns per read : 85.7676 0==0
ns per read : 85.4356 0==0