18

I have a laptop with Intel Core 2 Duo 2.4GHz CPU and and 2x4Gb DDR3 modules 1066MHz.

I expect that this this memory could operate at speed 1067 MiB/sec, and as long as there are two channels, maximum speed is 2134 MiB/sec (in case OS memory dispatcher will allow).

I made a tiny Java app to test that:

private static final int size = 256 * 1024 * 1024; // 256 Mb
private static final byte[] storage = new byte[size];

private static final int s = 1024; // 1Kb
private static final int duration = 10; // 10sec

public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Random rnd = new Random();
    byte[] buf1 = new byte[s];
    rnd.nextBytes(buf1);
    long count = 0;
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(buf1, 0, storage, (int) begin, s);
        ++count;
    }
    double totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    double speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");

    byte[] buf2 = new byte[s];
    count = 0;
    start = System.currentTimeMillis();
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(storage, (int) begin, buf2, 0, s);
        Arrays.fill(buf2, (byte) 0);
        ++count;
    }
    totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
}

I expected the result to be under 2134 MiB/sec however I have got the following:

17530212352 bytes transferred in 10.0 secs (1671.811328125 MiB/sec)
31237926912 bytes transferred in 10.0 secs (2979.080859375 MiB/sec)

How is that possible that speed was almost 3 GiB/sec?

DDR3 module photo

yanana
  • 2,241
  • 2
  • 18
  • 28
Anthony
  • 12,407
  • 12
  • 64
  • 88

4 Answers4

22

Here are multiple things at work.

First of all: the formula for memory transfer rate of DDR3 is

memory clock rate
× 4  (for bus clock multiplier)
× 2  (for data rate)
× 64 (number of bits transferred)
/ 8  (number of bits/byte)
=    memory clock rate × 64 (in MB/s)

For DDR3-1066 (which is clocked at 133⅓ MHz), we obtain a theoretical memory bandwidth8533⅓ MB/s or 8138.02083333... MiB/s for single-channel, and 17066⅔ MB/s, or 16276.0416666... MiB/s for dual-channel.

Second: transfer of one big chunk of data is faster than transfer of many small chunks of data.

Third: the test ignores caching effects, which can occur.

Fourth: if one makes time measurements, one should use System.nanoTime(). This method is more precise.

Here is a rewritten version of the test program 1.

import java.util.Random;

public class Main {

  public static void main(String... args) {
    final int SIZE = 1024 * 1024 * 1024;
    final int RUNS = 8;
    final int THREADS = 8;
    final int TSIZE = SIZE / THREADS;
    assert (TSIZE * THREADS == THREADS) : "TSIZE must divide SIZE!";
    byte[] src = new byte[SIZE];
    byte[] dest = new byte[SIZE];
    Random r = new Random();
    long timeNano = 0;

    Thread[] threads = new Thread[THREADS];
    for (int i = 0; i < RUNS; ++i) {
      System.out.print("Initializing src... ");
      for (int idx = 0; idx < SIZE; ++idx) {
        src[idx] = ((byte) r.nextInt(256));
      }
      System.out.println("done!");
      System.out.print("Starting test... ");
      for (int idx = 0; idx < THREADS; ++idx) {
        final int from = TSIZE * idx;
        threads[idx]
            = new Thread(() -> {
          System.arraycopy(src, from, dest, 0, TSIZE);
        });
      }
      long start = System.nanoTime();
      for (int idx = 0; idx < THREADS; ++idx) {
        threads[idx].start();
      }
      for (int idx = 0; idx < THREADS; ++idx) {
        try {
          threads[idx].join();
        } catch (InterruptedException e) {
          e.printStackTrace();
        }
      }
      timeNano += System.nanoTime() - start;
      System.out.println("done!");
    }
    double timeSecs = timeNano / 1_000_000_000d;

    System.out.println("Transfered " + (long) SIZE * RUNS
        + " bytes in " + timeSecs + " seconds.");

    System.out.println("-> "
        + ((long) SIZE * RUNS / timeSecs / 1024 / 1024 / 1024)
        + " GiB/s");
  }
}

This way, as much "other computation" as possible is mitigated and (almost) only memory copy rate via System.arraycopy(...) is measured. This algorithm may still have issues with regards to caching.

For my system (Dual Channel DDR3-1600), I get something around 6 GiB/s, whereas the theoretical limit is around 25 GiB/s (including DualChannel).

As was pointed out by Nick Mertin, the JVM introduces some overhead. Therefore, it is expected that you are not able to reach the theoretical limit.


1 Sidenote: To run the program, one must give the JVM more heapspace. In my case, 4096 MB were sufficient.

Turing85
  • 18,217
  • 7
  • 33
  • 58
  • I tried this test. Had to decrease SIZE to 256Mb. Got: `Transfered 2147483648 bytes in 1.42 seconds. -> 1.41 GiB/s` – Anthony Jul 03 '15 at 20:43
  • 2
    @Antonio why did you decrease the `SIZE`? To get closer to the theoretical limit, the chunks should be as large as possible. It would be better to give the JVM more heap space (`java -Xmx4096m ...`) than decreasing the chunk size. – Turing85 Jul 03 '15 at 20:53
  • I set storage size to 1Gb and duration to 30 sec. The results became nearly identical - 1.42GiB/sec. CPU is not a bottleneck (used by 60%). – Anthony Jul 04 '15 at 07:48
  • 2
    @Antonio Your theoretical limit (including Dual-Channel) is round about 16 GiB/s, so the theoretical limit is not exceeded. Keep in mind that my benchmark has to transfer everything twice (from memory to CPU and from CPU back to memory). So my program actually transers 2.84 GiB/s. Your benchmark, however, may profit heavily from caching (your source array is only 1KB in size and therefore might be cached entirely). So basically both benchmarks show kind of the same performance. – Turing85 Jul 14 '15 at 18:08
  • @Turing85 It is important to remember that because it is running in a JVM, which is in turn running on the OS, there is extra CPU overhead that is independent of the Java code itself. Also, the OS likely prevents a single process from using a large amount of CPU, so even though it was only using 60%, the CPU could still have been the bottleneck. – Nick Mertin Jul 15 '15 at 23:08
  • @MagikM18 at least for my system, the OS did not prevent the program from high CPU load (I was able to get the load well beyond 90%). Maybe the comment was aimed at Antonio? – Turing85 Jul 15 '15 at 23:39
  • @Turing85 it was aimed at both of you; i forgot to mention Antonio. However, even if that part doesn't apply to you, the inefficiencies of the JVM itself, compared to the kernel itself, apply to all Java code. – Nick Mertin Jul 15 '15 at 23:42
  • memory transfers are performed by the CPU and the CPU *core* running this code must be at 100%, or you are not reaching the limit at all. Just google for "memory benchmark" or "stream benchmark" for tools measuring peak memory performance; you will find also tools performing random access tests as well. Current CPUs exhibit maximum performance using vector instructions (AVX). I suppose Java `ArrayCopy` use them as well, but I cannot be sure. The cited tools (modern versions) do make use of vector extensions (some also let you choose how to transfer memory). – Sigi Jul 16 '15 at 13:31
  • related: single-threaded bandwidth on modern Intel CPUs is limited by max-concurrency a single core can keep in flight (and memory latency), not DRAM controller bandwidth. It's even worse on many-core CPUs with higher uncore latency, even if they have more DRAM channels. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) – Peter Cordes Dec 30 '18 at 05:26
  • Also related: [How can cache be that fast?](//electronics.stackexchange.com/a/329955) re: L1d vs. L2 vs. L3 vs. DRAM cache bandwidths (aggregate across multiple cores) – Peter Cordes Jun 18 '19 at 23:42
  • @NickMertin: On an otherwise-idle system, mainstream OSes like Linux allow a thread to use basically 100% of the time on a CPU if it wants to. They don't artificially make CPU cores spend some time idle. If there are more tasks waiting to run than you have cores (e.g. on a busy server), then yes the OS will avoid starving any of them, otherwise you can expect that a CPU-bound task will get at least 99% of a CPU to itself. Just interrupts taking some time away. – Peter Cordes Sep 23 '20 at 22:36
8

Your testing method is ill-designed in many aspects, as well as your interpretation of the RAM rating.

Let's start with the rating; since the introduction of SDRam, marketing names the modules after their bus specification - that is the bus clock frequency, paired with the burst transfer rate. That's the best case, and in practice it can not be sustained continuously.

Parameters omitted by that label are actual access time (aka. latency) and total cycle time (aka. precharge time). These can be figured out by actually looking at the "timing" specs (the 2-3-3 stuff). Look up an article that explains that stuff in detail. Actually the CPU does not normally transfer single bytes, but entire cache lines (eg. 8 entries per 8 bytes = 64 bytes).

Your testing code is ill-designed, as you are doing random access with a relatively tiny block unaligned to actual data boundaries. This random access also incurs frequent page misses in the MMU (learn what the TLB is/does). So you are measuring a wild mixture of different system aspects.

Mike
  • 2,399
  • 1
  • 20
  • 26
Durandal
  • 19,919
  • 4
  • 36
  • 70
  • 2
    Actually it was my goal to test speed when desired memory block is not in cache. Let me wind what timings are and I'll be back. Thank you for the answer. – Anthony Jul 03 '15 at 19:51
  • 1
    I added a photo of the DDR module. There are no timings as far as I can see. – Anthony Jul 03 '15 at 20:07
  • 1
    After some googling I've found CL=7. So how would you calculate a limit? – Anthony Jul 03 '15 at 20:23
  • 2
    @Antonio Start at the wiki entry mentioned in a comment to your question, CL stands for *Column Access Latency*, thats only one of the parameters. The modules "know" their timings, that is the BIOS of the mainboard reads the parameters and adjusts its timinigs to match the RAM. There are tools that can display these values. For the "formula" how to acually calculate something useful from the timings, seriously start at wiki, maybe supplement it with the *basics of DRAM access*. Its a relatively complicated and *wide* topic. I don't know every detail myself. – Durandal Jul 04 '15 at 14:36
1

In Wikipedia there is a table of transfer rates. This particular laptop has the following specs:

  • Module type: PC3-8500 DDR3 SDRAM
  • Chip type: DDR3-1066
  • Memory clock: 133MHz
  • Bus speed: 1.066GT/s
  • Transfer rate (bits/s): 64Gbit/s
  • Transfer rate (decimal bytes/s): 8GB/s

This is per single DDR3 module per single channel.

Anthony
  • 12,407
  • 12
  • 64
  • 88
1

This could be a matter of hardware configuration. Based on the information provided there are two cores and two memory modules but the number of memory channels is unclear. While I have never seen testing done at the scale of a laptop, on larger systems the configuration of DIMMs in the memory channels can have a significant impact on memory transfer rates.

For example on modern servers it is possible to have One DIMM per Channel(ODPC) or Two DIMM per Channel(TDPC) memory configurations. Each physical CPU can have multiple memory channels divided amongst the physical cores on said CPU, and each server could potentially have multiple physical CPUs(typically 2-4 in modern servers).

How the memory is distributed amongst these channels, cores and CPUs/chips can have a significant impact on the performance of memory depending on what is being measured. For example systems with a ODPC configuration will have significantly improved transfer times(in terms of Transfers per second or MegaTransfers per second, MT/s) when compared with systems that have a TDPC configuration in cases where the amount of memory(in GB) in the TDPC system is equal to or greater than the amount of memory in the ODPC configuration.

Based on this knowledge, it is conceivable that a laptop which is setup with 2 memory channels in a ODPC and one channel per core manner could theoretically achieve the performance described.

With all that being said there are a number of prepackaged memory profiling and analyzing tools that can be run non-invasively to get information about the performance of the memory on your system. Memtest is a very powerful, well understood and well documented tool for testing memory. It can be downloaded onto a bootable disk of some sort(USB, DVD, floppy, etc) that can be safely used to stress out the memory on a system without the potential to damage or disturb the OS. It is also included on the install DVD for some linux distributions as well as rescue DVDs/images. It is a very powerful tool that I have used on many occasions to debug and analyze the performance of memory, though normally on servers.

Matt
  • 545
  • 3
  • 16