7

I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.

The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time. I use g++ 5.x on Ubuntu 16.04.

edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3 or -funroll-loops, it is as if the compiler kicks in early and optimize before the pragma is seen.

I came up with the following, simple test:

#define SIZE 10000000
#define TRIES 200
int main(){

    std::vector<double> a,b,c;
    a.resize(SIZE);
    b.resize(SIZE);
    c.resize(SIZE);

    double start = omp_get_wtime();
    unsigned long int i,t;
    #pragma omp parallel shared(a,b,c) private(i,t)
    {
    for( t = 0; t< TRIES; t++){
       #pragma omp for
       for( i = 0; i< SIZE; i++){
        c[i] = a[i] + b[i];
       }
    }
    }

    std::cout << "finished in " << omp_get_wtime() - start << std::endl;

    return 0;

}

I compile with

   g++ -O3 -fopenmp  -std=c++11 main.cpp

And get for one threads

>time ./a.out 
 finished in 2.5638
 ./a.out  2.58s user 0.04s system 99% cpu 2.619 total.

For two threads, loop takes 1.2s, for 1.23 total.

Now if I use raw arrays:

 int main(){
    double *a, *b, *c;
    a = new double[SIZE];
    b = new double[SIZE];
    c = new double[SIZE];
    double start = omp_get_wtime();
    unsigned long int i,t;
    #pragma omp parallel shared(a,b,c) private(i,t)
    {
       for( t = 0; t< TRIES; t++)
       {
          #pragma omp for
          for( i = 0; i< SIZE; i++)
          {
             c[i] = a[i] + b[i];
          }
       }
    }

    std::cout << "finished in " << omp_get_wtime() - start << std::endl;

    delete[] a;
    delete[] b;
    delete[] c;

    return 0;
}

And i get (1 thread):

>time ./a.out
 finished in 1.92901 
  ./a.out  1.92s user 0.01s system 99% cpu 1.939 total   

std::vector is 33% slower!

For two threads:

>time ./a.out 
finished in 1.20061                                                              
./a.out  2.39s user 0.02s system 198% cpu 1.208 total   

As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.

Now, i thought the std::vector has almost no overhead? What is happening here? I'd like to use nice standard library objects.

I cannot find any reference anywhere on a simple example like this.

Zulan
  • 21,896
  • 6
  • 49
  • 109
Napseis
  • 833
  • 2
  • 10
  • 24
  • If you know the exact size at compile time just use `std::array` it might be a little faster than calling resize on a vector for a capacity that large. – Omar Martinez Jun 01 '17 at 16:27
  • What compiler version do you use? What Processor/Memory/OS? It might be necessary to initialize the data with values that are not predetermined zero to get comparable results. – Zulan Jun 01 '17 at 16:58
  • Hello. The timer I refer to, as for 33%, is only timing the addition, especially not to take into account any initialization. I'm really talking only about the sum. The size is not known at compile time. I use g++ 5.something on Ubuntu 16.04. As for processor, i don't see how this is relevant. This was on a i7 6500U i think. 4M L3 cache, for sure. – Napseis Jun 01 '17 at 18:09
  • 1
    Btw: I don't think this is an OpenMP question as you can omit the pragmas and the behavior will still be the same. And you did not even compared the a two thread run with `std::vector` with your two thread run with raw arrays. – BlameTheBits Jun 01 '17 at 18:51
  • i did, just not posted it. 2.6s for 1 threads, 1.2 for 2.. – Napseis Jun 01 '17 at 18:52

6 Answers6

5

Meaningful benchmarking is hard

The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve initializes the data to zero, whereas new double[size] does not. Note that you can use new double[size]() to force initalization.

However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?

Minimal example

Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).

#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>

constexpr size_t size = 10'000'000;

auto time_pointer(size_t reps, bool initialize, double init_value) {
    double * a = new double[size];
    double * b = new double[size];
    double * c = new double[size];
    
    if (initialize) {
        for (size_t i = 0; i < size; i++) {
            a[i] = b[i] = c[i] = init_value;
        }
    }

    auto start = std::chrono::steady_clock::now();
    
    for (size_t t = 0; t < reps; t++) {
        for (size_t i = 0; i < size; i++) {
            c[i] = a[i] + b[i];
        }
    }

    auto end = std::chrono::steady_clock::now();

    delete[] a;
    delete[] b;
    delete[] c;
       
    return end - start;
}

int main(int argc, char* argv[]) {
    bool initialize = (argc == 3);
    double init_value = 0;
    if (initialize) {
        init_value = std::stod(argv[2]);
    }
    auto reps = std::stoll(argv[1]);
    std::cout << "pid: " << getpid() << "\n";
    auto t = time_pointer(reps, initialize, init_value);
    std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
    return 0;
}

Results are consistent:

./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms

First glance at performance counters

Using the excellent Linux perf tool:

$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50  
pid: 12481
         626ms

 Performance counter stats for './a.out 50':

       101.589.231      LLC-loads                                                   
           105.415      dTLB-misses                                                 

       0,629369979 seconds time elapsed

$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
        1008ms

 Performance counter stats for './a.out 50 0.':

       145.218.903      LLC-loads                                                   
         1.889.286      dTLB-misses                                                 

       1,096923077 seconds time elapsed

Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?

Memory is complex

To understand that, we need to understand how memory is allocated. Just because a malloc / new returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types (from linux/tools/vm - and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:

With initialization

                 flags  page-count       MB  symbolic-flags         long-symbolic-flags
    0x0000000000000804           1        0  __R________M______________________________ referenced,mmap
    0x000000000004082c         392        1  __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
    0x000000000000086c         335        1  __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
    0x0000000000401800       56721      221  ___________Ma_________t___________________ mmap,anonymous,thp
    0x0000000000005868        1807        7  ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x0000000000405868         111        0  ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
    0x000000000000586c           1        0  __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                 total       59368      231

Most of the virtual memory is in a normal mmap,anonymous region - something that is mapped to a physical address.

Without initialization

             flags  page-count       MB  symbolic-flags         long-symbolic-flags
0x0000000001000000        1174        4  ________________________z_________________ zero_page
0x0000000001400000       37888      148  ______________________t_z_________________ thp,zero_page
0x0000000000000800           1        0  ___________M______________________________ mmap
0x000000000004082c         388        1  __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c         347        1  __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800       18907       73  ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868         633        2  ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868          37        0  ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c           1        0  __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total       59376      231

Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a and b is all backed by a single read-only 4kiB page filled with zeros. c (and a, b in the other test) have already been written to, so it has to have it's own memory.

0 != 0

Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you memset(0), a[i] = 0., or std::vector::reserve - everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset / reserve is to use calloc to explicitly request zero'd memory, which is probably backed by a zero_page, but I doubt it is done (or makes a lot of sense). Remember that for new double[size]; or malloc there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.

1: Remember that the double 0.0 has all bits set to zero.

In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.

Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
2

I have a good hypothesis.

I've written three versions of the code: one using raw double *, one using std::unique_ptr<double[]> objects, and one using std::vector<double>, and compared the runtimes of each of these versions of the code. For my purposes, I've used a single-threaded version of the code to try to simplify the case.

Total Code::

#include<vector>
#include<chrono>
#include<iostream>
#include<memory>
#include<iomanip>

constexpr size_t size = 10'000'000;
constexpr size_t reps = 50;

auto time_vector() {
    auto start = std::chrono::steady_clock::now();
    {
        std::vector<double> a(size);
        std::vector<double> b(size);
        std::vector<double> c(size);

        for (size_t t = 0; t < reps; t++) {
            for (size_t i = 0; i < size; i++) {
                c[i] = a[i] + b[i];
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    return end - start;
}

auto time_pointer() {
    auto start = std::chrono::steady_clock::now();
    {
        double * a = new double[size];
        double * b = new double[size];
        double * c = new double[size];

        for (size_t t = 0; t < reps; t++) {
            for (size_t i = 0; i < size; i++) {
                c[i] = a[i] + b[i];
            }
        }

        delete[] a;
        delete[] b;
        delete[] c;
    }
    auto end = std::chrono::steady_clock::now();
    return end - start;
}

auto time_unique_ptr() {
    auto start = std::chrono::steady_clock::now();
    {
        std::unique_ptr<double[]> a = std::make_unique<double[]>(size);
        std::unique_ptr<double[]> b = std::make_unique<double[]>(size);
        std::unique_ptr<double[]> c = std::make_unique<double[]>(size);

        for (size_t t = 0; t < reps; t++) {
            for (size_t i = 0; i < size; i++) {
                c[i] = a[i] + b[i];
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    return end - start;
}

int main() {
    std::cout << "Vector took         " << std::setw(12) << time_vector().count() << "ns" << std::endl;
    std::cout << "Pointer took        " << std::setw(12) << time_pointer().count() << "ns" << std::endl;
    std::cout << "Unique Pointer took " << std::setw(12) << time_unique_ptr().count() << "ns" << std::endl;
    return 0;
}

Test Results:

Vector took           1442575273ns //Note: the first one executed, regardless of 
    //which function it is, is always slower than expected. I'll talk about that later.
Pointer took           542265103ns
Unique Pointer took   1280087558ns

So all of STL objects are demonstrably slower than the raw version. Why might this be?

Let's go to the Assembly! (compiled using Godbolt.com, using the snapshot version of GCC 8.x)

There's a few things we can observe to start with. For starters, the std::unique_ptr and std::vector code are generating virtually identical assembly code. std::unique_ptr<double[]> swaps out new and delete for new[] and delete[]. Since their runtimes are within margin of error, we'll focus on the std::unique_ptr<double[]> version and compare that to double *.

Starting with .L5 and .L22, the code seems to be identical. The only major differences are an extra pointer arithmetic before the delete[] calls are made in the double * version, and some extra stack cleanup code at the end in .L34 (std::unique_ptr<double[]> version), which doesn't exist for the double * version. Neither of these seem likely to have strong impact on the code speed, so we're going to ignore them for now.

The code that's identical appears to be the code directly responsible for the loop. You'll notice that the code which is different (which I'll get to momentarily) doesn't contain any jump statements, which are integral to loops.

So all of the major differences appear to be specific to the initial allocation of the objects in question. This is between time_unique_ptr(): and .L32 for the std::unique_ptr<double[]> version, and between time_pointer(): and .L22 for the double * version.

So what's the difference? Well, they're almost doing the same thing. Except for a few lines of code that show up in the std::unique_ptr<double[]> version that don't show up in the double * version:

std::unique_ptr<double[]>:

mov     edi, 80000000
mov     r12, rax
call    operator new[](unsigned long)
mov     edx, 80000000
mov     rdi, rax
xor     esi, esi //Sets register to 0, which is probably used in...
mov     rbx, rax
call    memset //!!!
mov     edi, 80000000
call    operator new[](unsigned long)
mov     rdi, rax
mov     edx, 80000000
xor     esi, esi //Sets register to 0, which is probably used in...
mov     rbp, rax
call    memset //!!!
mov     edi, 80000000
call    operator new[](unsigned long)
mov     r14, rbx
xor     esi, esi //Sets register to 0, which is probably used in...
mov     rdi, rax
shr     r14, 3
mov     edx, 80000000
mov     r13d, 10000000
and     r14d, 1
call    memset //!!!

double *:

mov     edi, 80000000
mov     rbp, rax
call    operator new[](unsigned long)
mov     rbx, rax
mov     edi, 80000000
mov     r14, rbx
shr     r14, 3
call    operator new[](unsigned long)
and     r14d, 1
mov     edi, 80000000
mov     r12, rax
sub     r13, r14
call    operator new[](unsigned long)

Well would you look at that! Some unexpected calls to memset that aren't part of the double * code! It's quite clear that std::vector<T> and std::unique_ptr<T[]> are contracted to "initialize" the memory they allocate, whereas double * has no such contract.

So this is basically a very, very round-about way of verifying what Shadow observed: When you make no attempt to "zero-fill" the arrays, the compiler will

  • Do nothing for double * (saving precious CPU cycles), and
  • Do the initialization without prompting for std::vector<double> and std::unique_ptr<double[]> (costing time initializing everything).

But when you do add zero-fill, the compiler recognizes that it's about to "repeat itself", optimizes out the second zero-fill for std::vector<double> and std::unique_ptr<double[]> (which results in the code not changing) and adds it to the double * version, making it the same as the other two versions. You can confirm this by comparing the new version of the assembly where I've made the following change to the double * version:

double * a = new double[size];
for(size_t i = 0; i < size; i++) a[i] = 0;
double * b = new double[size];
for(size_t i = 0; i < size; i++) b[i] = 0;
double * c = new double[size];
for(size_t i = 0; i < size; i++) c[i] = 0;

And sure enough, the assembly now has those loops optimized into memset calls, the same as the std::unique_ptr<double[]> version! And the runtime is now comparable.

(Note: the runtime of the pointer is now slower than the other two! I observed that the first function called, regardless of which one, is always about 200ms-400ms slower. I'm blaming branch prediction. Either way, the speed should be identical in all three code paths now).

So that's the lesson: std::vector and std::unique_ptr are making your code slightly safer by preventing that Undefined Behavior you were invoking in your code that used raw pointers. The consequence is that it's also making your code slower.

Xirema
  • 19,889
  • 4
  • 32
  • 68
  • 1
    hi,Thanks for this detailed answer. It seems to me it explains the difference in total execution time, but not the loop part ? Did i get correct what you mean ? It's however not so clear to me as why the initialization changes the loop part. With what version and what optimization did you compile ? – Napseis Jun 01 '17 at 20:30
  • @Napseis The loop didn't change at all. The assembly is functionally identical in the loop in all 6 versions of the code (3 without explicit zero-fill, 3 with). The difference is that `memset` is not a "free" operation, and `std::vector` and `std::make_unique` called it (eating up a lot of time) whereas `new double[]` did not. – Xirema Jun 01 '17 at 20:32
  • the problem here is with open mp. It seems to me that the optimization kicks in before the openmp pragma. – Napseis Jun 01 '17 at 20:33
  • I am confused. Does this mean that `memset` is overlapping with the call of the timer? (or vice versa) – BlameTheBits Jun 01 '17 at 20:37
  • Ah, I did not recognize that you are timing the instantiation of the arrays/vectors too. Well, I did not. I only measured the loop but still got the same results... – BlameTheBits Jun 01 '17 at 22:46
  • Good answer revealing the difference! I made another answer explaining why initialization has such a performance impact. – Zulan Jun 02 '17 at 11:47
  • 1
    @Shadow it doesn't matter what part of code was measured if the code has UB – Ap31 Jun 02 '17 at 20:08
  • @Ap31 Yeah, it is UB. But it does not matter because it takes not much time (as Zulan already pointed out) and not because it is UB. Why shouldn't you measure time of code with UB? We actually got a nice answer on how memory allocation works due to measuring of code with UB. (Btw My comment about what was measured was just written in a hurry and I did not thought of the allocation taking not much time comparing to the loop. Sorry if my comments about that lead to confusion.) – BlameTheBits Jun 02 '17 at 20:33
  • @Shadow I think the point is more that when UB is invoked, it can cause code to get shuffled around in ways you definitely don't expect. So it's not about "don't bother benchmarking", it's "be careful what conclusions you draw when benchmarking". – Xirema Jun 02 '17 at 20:39
  • @Shadow I don't see how measuring UB code can produce any valuable information at all. The result could be literally anything by definition of UB, and thus it has no connection to whatever it is you want to measure. Measuring UB (and having UB in general) is very much not OK – Ap31 Jun 02 '17 at 20:43
  • @Ap31 Of course many strange things can happen but in general I don't expect too crazy results, especially if the loops in assembler are identical. UB in the cases I know just means "this or this or that" can happen or a mix of all possibilities. And this is also here the case with the virtual and physical memory. Of course I do not make conclusions about code with UB out of time measurements. – BlameTheBits Jun 02 '17 at 20:56
  • @Shadow I have to say HristoIliev's answer is very good and explains the mechanics perfectly, but this is exactly how UB is connected to all of this - it enables any OS-specific behavior for reading unitialized memory – Ap31 Jun 03 '17 at 05:21
2

The observed behaviour is not OpenMP-specific and has to do with the way modern operating systems manage memory. Memory is virtual, meaning that each process has its own virtual address (VA) space and a special translation mechanism is used to map pages of that VA space to frames of physical memory. Consequently, memory allocation is performed in two stages:

  • reservation of a region within the VA space - this is what operator new[] does when the allocation is big enough (smaller allocations are handled differently for reasons of efficiency)
  • actually backing the region with physical memory upon access to some part of the region

The process is split in two parts since in many cases applications do not really use at once all the memory they reserve and backing the entire reservation with physical memory might lead to waste (and unlike virtual memory, physical one is a very limited resource). Therefore, backing reservations with physical memory is performed on-demand the very first time the process writes to a region of the allocated memory space. The process is known as faulting the memory region since on most architectures it involves a soft page-fault that triggers the mapping within the OS kernel. Every time your code writes for the first time to a region of memory that is still not backed by physical memory, a soft page-fault is triggered and the OS tries to map a physical page. The process is slow as it involves finding a free page and modification on the process page table. The typical granularity of that process is 4 KiB unless some kind of large pages mechanism is in place, e.g., the Transparent Huge Pages mechanism on Linux.

What happens if you read for the first time from a page that has never been written to? Again, a soft page fault occurs, but instead of mapping a frame of physical memory, the Linux kernel maps a special "zero page". The page is mapped in CoW (copy-on-write) mode, which means that when you try to write it, the mapping to the zero page will be replaced by a mapping to a fresh frame of physical memory.

Now, take a look at the size of the arrays. Each of a, b, and c occupies 80 MB, which exceeds the cache size of most modern CPUs. One execution of the parallel loop thus has to bring 160 MB of data from the main memory and write back 80 MB. Because of how system cache works, writing to c actually reads it once, unless non-temporal (cache-bypassing) stores are used, therefore 240 MB of data is read and 80 MB of data gets written. Multiplied by 200 outer iterations, this gives 48 GB of data read and 16 GB of data written in total.

The above is not the case when a and b are not initialised, i.e. the case when a and b are simply allocated using operator new[]. Since reads in those case result in access to the zero page, and there is physically only one zero page that easily fits in the CPU cache, no real data has to be brought in from the main memory. Therefore, only 16 GB of data has to be read in and then written back. If non-temporal stores are used, no memory is read at all.

This could be easily proven using LIKWID (or any other tool able to read the CPU hardware counters):

std::vector<double> version:

$ likwid-perfctr -C 0 -g HA a.out
...
+-----------------------------------+------------+
|               Metric              |   Core 0   |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     4.4796 |
|        Runtime unhalted [s]       |     5.5242 |
|            Clock [MHz]            |  2850.7207 |
|                CPI                |     1.7292 |
|  Memory read bandwidth [MBytes/s] | 10753.4669 |
|  Memory read data volume [GBytes] |    48.1715 | <---
| Memory write bandwidth [MBytes/s] |  3633.8159 |
| Memory write data volume [GBytes] |    16.2781 |
|    Memory bandwidth [MBytes/s]    | 14387.2828 |
|    Memory data volume [GBytes]    |    64.4496 | <---
+-----------------------------------+------------+

Version with uninitialised arrays:

+-----------------------------------+------------+
|               Metric              |   Core 0   |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     2.8081 |
|        Runtime unhalted [s]       |     3.4226 |
|            Clock [MHz]            |  2797.2306 |
|                CPI                |     1.0753 |
|  Memory read bandwidth [MBytes/s] |  5696.4294 |
|  Memory read data volume [GBytes] |    15.9961 | <---
| Memory write bandwidth [MBytes/s] |  5703.4571 |
| Memory write data volume [GBytes] |    16.0158 |
|    Memory bandwidth [MBytes/s]    | 11399.8865 |
|    Memory data volume [GBytes]    |    32.0119 | <---
+-----------------------------------+------------+

Version with uninitialised array and non-temporal stores (using Intel's #pragma vector nontemporal):

+-----------------------------------+------------+
|               Metric              |   Core 0   |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     1.5889 |
|        Runtime unhalted [s]       |     1.7397 |
|            Clock [MHz]            |  2530.1640 |
|                CPI                |     0.5465 |
|  Memory read bandwidth [MBytes/s] |   123.4196 |
|  Memory read data volume [GBytes] |     0.1961 | <---
| Memory write bandwidth [MBytes/s] | 10331.2416 |
| Memory write data volume [GBytes] |    16.4152 |
|    Memory bandwidth [MBytes/s]    | 10454.6612 |
|    Memory data volume [GBytes]    |    16.6113 | <---
+-----------------------------------+------------+

The disassembly of the two versions provided in your question when using GCC 5.3 shows that the two loops are translated to exactly the same sequence of assembly instructions sans the different code address. The sole reason for the difference in the execution time is the memory access as explained above. Resizing the vectors initialises them with zeros, which results in a and b being backed up by their own physical memory pages. Not initialising a and b when operator new[] is used results in their backing by the zero page.

Edit: It took me so long to write this that in the mean time Zulan has written a way more technical explanation.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • I am kinda pushing for "you should not measure and compare UB" here, but your answer and explanation are really impressive – Ap31 Jun 03 '17 at 06:27
1

I tested it and found out the following: The vector case had a runtime about 1.8 times longer than the raw array case. But this was only the case when I did not initialize the raw array. After adding a simple loop before the time measurement to initialize all entries with 0.0 the raw array case took as long as the vector case.

It took a closer look and did the following: I did not initialize the raw arrays like

for (size_t i{0}; i < SIZE; ++i)
    a[i] = 0.0;

but did it this way:

for (size_t i{0}; i < SIZE; ++i)
    if (a[i] != 0.0)
    {
        std::cout << "a was set at position " << i << std::endl;
        a[i] = 0.0;
    }

(the other arrays accordingly).
The result was that I got no console output from initializing the arrays and it was again as fast as without initializing at all, that is about 1.8 faster than with the vectors.

When I initialized for example only a "normal" and the other two vector with the if clause I measured a time between the vector runtime and the runtime with all arrays "fake initialized" with the if clause.

Well... that's strange...

Now, i thougt the std::vector has almost no overhead ? What is happening here ? I'd like to use nice STL objects...

Although I cannot explain you this behavior, I can tell you that there is not really an overhead for std::vector if you use it "normal". This is just a very artificial case.

EDIT:

As qPCR4vir and the OP Napseis pointed out this might have to do with optimization. As soon as I turned on optimization the "real init" case was about the already mentioned factor of 1.8 slower. But without it was still about 1.1 times slower.

So I looked at the assembler code but I did not saw any difference in the 'for' loops...

BlameTheBits
  • 850
  • 1
  • 6
  • 22
  • I didn't believe what you wrote, so i tried. This is happening on my side too. If i initialize the c-array, then the openmp loop takes longer. This is plainly un-understandable. Plus, the parallel loop doesn't show anymore gain for the c array! – Napseis Jun 01 '17 at 19:05
  • Something is going on with optimisation. Do again your test, but this time, with -O0 or 1 or 2... – Napseis Jun 01 '17 at 19:16
  • Yep, I am already testing different settings. I think I will make a edit in a few minutes. – BlameTheBits Jun 01 '17 at 19:26
  • I've confirmed by analyzing the assembly that `std::vector` and `std::unique_ptr` are adding *Zero-Fill* to the code, that doesn't exist in the `double *` version. You can see my answer for the details. – Xirema Jun 01 '17 at 20:26
1

The major thing to notice here is the fact that

The array version has undefined behavior

dcl.init #12 states:

If an indeterminate value is produced by an evaluation, the behavior is undefined

And this is exactly what happens in that line:

c[i] = a[i] + b[i];

Both a[i] and b[i] are indeterminate values since the arrays are default-initialized.

The UB perfectly explains the measuring results (whatever they are).

UPD: In the light of @HristoIliev and @Zulan answers I'd like to emphasize language POV once more.

The UB of reading uninitialized memory for the compiler essentialy means that it can always assume that memory is initialized, so whatever the OS does is fine with C++, even if the OS has some specific behavior for that case.

Well it turns out that it does - your code is not reading the physical memory and your measurements correspond to that.

One could say that the resulting program does not compute the sum of two arrays - it computes the sum of two more easily accessible mocks, and it is fine with C++ exactly because of the UB. If it did something else, it would still be perfectly fine.

So in the end you have two programs: one adds up two vectors and the other just does something undefined (from C++ point of view) or something unrelated (from OS point of view). What is the point of measuring their timings and comparing the results?

Fixing the UB solves the whole problem, but more importantly it validates your measurements and allows you to meaningfully compare the results.

Ap31
  • 3,244
  • 1
  • 18
  • 25
  • " since the arrays are default-initialized" I thought they are _not_ initialized at all? I thought it was obvious that not initializing the arrays leads to UB but until Zulan's answer I thought it is just undefined in terms of "the elements of the array could hold any value". – BlameTheBits Jun 02 '17 at 20:40
  • @Shadow default-initializatied means "not initialized" for non-class objects: [dcl.init #7.3](http://eel.is/c++draft/dcl.init#7.3) – Ap31 Jun 03 '17 at 04:42
  • Ah, ok. Thanks for the clarification. But if I get [cppreference.com](http://en.cppreference.com/w/cpp/language/default_initialization) right this was introduced with C++11? "Scalars and POD types with dynamic storage duration were considered to be not initialized (since C++11, this situation was reclassified as a form of default initialization)." – BlameTheBits Jun 03 '17 at 07:13
  • @Shadow oh yeah, they changed the wording, but the underlying behavior did not change - array memory is not initialized, and reading from it is UB – Ap31 Jun 03 '17 at 07:42
-1

In this case, i think the culprit is -funroll-loops, from what i just tests in O2 with and without this option.

https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Optimize-Options.html#Optimize-Options

funroll-loops: Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

Napseis
  • 833
  • 2
  • 10
  • 24
  • I cannot confirm this. The results are the same for me (at least the ratio) and the assembler code just looks the same with and without the `-funroll-loops` flag. Btw this flag is not included when using `O3` (or lower opt level)... – BlameTheBits Jun 01 '17 at 20:14
  • It does in gcc 5, i corrected the link. What compiler are you using ? – Napseis Jun 01 '17 at 20:20
  • 5.4 too. Maybe I looked wrong. But just adding `-funroll-loops` without any optimization flag did not result in a speedup for me. (neither did it with opt flags) – BlameTheBits Jun 01 '17 at 20:29
  • that is super strange, i confirm that this changes for me, with -O2, for sure. – Napseis Jun 01 '17 at 20:31
  • -funroll-loops usually needs to be accompanied by --param max-unroll-times=4 or the like. The startup time for excessively unrolled loop will become more prominent when parallelized. – tim18 Jun 02 '17 at 02:26
  • I wonder why no one cares whether the loop is simd vectorized. Unless running with multiple memory controllers this should achive full speed. If there are questions such as whether a race condition exists among agents potentially modifying STL parameters such as vector size, appropriate pragma may be needed to optimize. – tim18 Jun 02 '17 at 02:28
  • @tim18 I did run it only in a single thread. And why should one care about SIMD as long as in both versions the loop looks exactly the same? The problem is somehow the initialization and I don't see any relation to wether SIMD is used or not. – BlameTheBits Jun 02 '17 at 07:20
  • It is easily possible that compiling without omp parallel produces simd parallel optimization while omp parallel does not. You don't even say what you are doing to account for the change in simd parallel optimization between -O2 and -O3. – tim18 Jun 02 '17 at 11:22