2

all: I have two pieces of code. The first one is:

#include <iostream>

using namespace std;

static constexpr long long n = 1000000000;

int main() {
  int sum = 0;
  int* a = new int[n];
  int* b = new int[n];

  for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
  }

  for (long long i=0; i<n; i++) {
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= b[i];
    sum += b[i];
  }

  cout<<sum<<endl;
}

The second one is:

#include <iostream>

using namespace std;

constexpr long long n = 1000000000;

int main() {
  int* a = new int[n];
  int* b = new int[n];
  int sum = 0;

  for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
    sum *= b[i];
    sum += b[i];
  }

  cout<<sum<<endl;
}

I think the first programs should be much faster than the second one, since it's more cache friendly. However, the truth is the second one is a litter faster. On my server, the first one takes 23s while the second one takes 20s, can some one explain this?

sh1
  • 4,324
  • 17
  • 30
Liu Renjie
  • 220
  • 2
  • 11
  • 5
    Still, it looks like running a 1000000000 loop two times instead of four is faster. I wonder why. Hit me with a shovel if I'm wrong but I think this is self-explanatory. – Steeve Nov 04 '16 at 08:52
  • 3
    Due to the massive amounts of integer overflows you're producing, your program has completely undefined behavior anyway. – Sebastian Redl Nov 04 '16 at 08:53
  • 5
    Not enough information. What compiler flags are you using? What's with all the static casting? Nevertheless, this may be a clone of the currently highest voted C++ question: http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array – Jon Chesterfield Nov 04 '16 at 08:54
  • Of course it would help to look at the generated assembly because it could have wild optimizations. – Steeve Nov 04 '16 at 08:54
  • checking assembly output can help a lot. usually making "assumptions" about what should be faster are mostly just guess work. That is why you should always measure before trying to optimze. anyway assembly can be found here: https://godbolt.org/g/ED1RRp – Hayt Nov 04 '16 at 08:57
  • 4
    Is the ~3s difference significant in relation to the variance of the measurements? – eerorika Nov 04 '16 at 09:01
  • 1
    Apart from wild optimisation the compiler may apply, the problem with "cache friendly" is not as simple as the first sight. E.g. it *may* have something to do with the fact that the memory for the two vectors finishes being stored on two different devices with their own controller, thus retrieving the data into the CPU cache can go in parallel. (somehow like RAID-0 - stripping only, no redundancy) – Adrian Colomitchi Nov 04 '16 at 09:01
  • @JonChesterfield They are different questions. My question is about cache-friendness while your post is about branch prediction. – Liu Renjie Nov 04 '16 at 09:36
  • In what way is the first program more cache friendly? – Leeor Nov 05 '16 at 22:32

3 Answers3

3

You're not seeing cache-friendliness advantages because the access pattern is still much too simple even in the version you predict to be slower.

Two (or more) concurrent streams of straight-line input is something a modern CPU can detect and stream into L1 ahead of it being needed.

It can also allow multiple SDRAM banks to be put to useful work at the same time. If you're using Linux you don't get much control over that because pages are mapped randomly (I think; is this still true?), but you can try allocating memory using mmap() with the MAP_HUGETLB argument and then try different offsets from the start of the allocation.

If you want to see the advantage of arranging your computations in a cache-friendly order you should perhaps experiment with different access patterns in two-dimensional arrays.

sh1
  • 4,324
  • 17
  • 30
2

Caches doesn't play a big role in your example. Linear access to an array munch bigger than the caches and with nearly no computing between the accesses will allways be limited by the memory bandwith not by the caches. They simply don't have enough time to fill up by prefetching.

What you are testing is the cleverness of your compiler to optimize your four/two loops into one or his cleverness to get the clue what you are doing and simply print the result.

user6556709
  • 1,272
  • 8
  • 13
-1

for the first piece of code, you use 4 loops to complete the task.

for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
  }

  for (long long i=0; i<n; i++) {
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= b[i];
    sum += b[i];
  }

while in the second one you use only 2 loops to complete the task.

for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
    sum *= b[i];
    sum += b[i];
  }

The number of iterations happening is much less in the second piece of code you provided.

Issac Saji
  • 106
  • 6
  • So, apart from the given result of the test, what makes you think that number of iterations always matters over cache friendliness? This was the basic idea of the question. – stefaanv Nov 04 '16 at 09:11
  • well OP thought the first one would be faster because the operations would be more cache-local resulting in less misses which could make the loops faster and make the iterations matter less. – Hayt Nov 04 '16 at 09:12
  • Even though there are more loops, the number of operations will be about the same, the only difference being in the increment of `i` and the number of comparisons with `n`. I don't think that's enough to justify the difference in performance, also because being `n` known at compile time the compiler can perform all sorts of optimizations – alessandrolenzi Nov 04 '16 at 11:11