How to fast initialize with 1 really big array

Question

I have enermous array:

int* arr = new int[BIGNUMBER];

How to fullfil it with 1 number really fast. Normally I would do

for(int i = 0; i < BIGNUMBER; i++)
    arr[i] = 1

but I think it would take long.

Can I use memcpy or similar?

Try to avoid assumptions about how long something will take to run. — Vaughn Cato, May 05 '13 at 14:23
If that's your bottleneck, then your code is structured wrong. I'm guessing you're filling the array for a reason, so you must have another (likely slower) loop after. Anyway `memcpy` is likely to be slower because it needs to do memory lookups instead of using a register / constant. — Dave, May 05 '13 at 14:34
FYI: I tried a memcpy based algorithm and didn't get any meaningful difference in speed. — Vaughn Cato, May 05 '13 at 15:02
Your real bottleneck is most likely memory bandwidth. In that case, optimizing for CPU time doesn't really buy you anything. — Vaughn Cato, May 05 '13 at 15:08
One thing that could make a difference (good or bad) is using non-temporal stores. That's something that gcc won't do for you (though Intel's compiler might, IIRC). — Marc Glisse, May 05 '13 at 15:28
The fastest way may actually be **NOT** to initialize the array up front, but to use the OS paging mechanism. E.g. fill one page, and use Copy On Write for the other pages. — MSalters, May 06 '13 at 08:34

score 15 · Accepted Answer · edited May 23 '17 at 12:34

15

You could try using the standard function std::uninitialized_fill_n:

#include <memory>

// ...

std::uninitialized_fill_n(arr, BIGNUMBER, 1);

In any case, when it comes to performance, the rule is to always make measurements to back up your assumptions - especially if you are going to abandon a clear, simple design to embrace a more complex one because of an alleged performance improvement.

EDIT:

Notice that - as Benjamin Lindley mentioned in the comments - for trivial types std::uninitialized_fill_n does not bring any advantage over the more obvious std::fill_n. The advantage would exist for non-trivial types, since std::uninitialized_fill would allow you to allocate a memory region and then construct objects in place.

However, one should not fall into the trap of calling std::uninitialized_fill_n for a memory region that is not uninitialized. The following, for instance, would give undefined behavior:

my_object* v = new my_object[BIGNUMBER];
std::uninitialized_fill_n(my_object, BIGNUMBER, my_object(42)); // UB!

edited May 23 '17 at 12:34

Community

1
1

answered May 05 '13 at 14:24

Andy Prowl

124,023
23
387
451

Is there any way to do it with memcpy or sth like that? – Yoda May 05 '13 at 14:26
@RobertKilar: I guess you could try it in this case, yes – Andy Prowl May 05 '13 at 14:28
Do you know hot to use memcpy to it? Because only i see is to copy one array to the another? – Yoda May 05 '13 at 14:30
1

@RobertKilar: Try using the `std::uninitialized_fill_n()` function before, and check if it is good enough for you. Most likely that function ends up invoking `memcpy()` for trivial types, and everything gets inlined. – Andy Prowl May 05 '13 at 14:30
2

What does `uninitialized_fill_n` offer over the more obvious `fill_n` in this case? – Benjamin Lindley May 05 '13 at 14:39
@Benjamin Lindley: a snippet from an online resource (cplusplus.com) explains pretty well -- it avoids temporary alloc and copy for more complicated objects: "Unlike algorithm fill_n, uninitialized_fill_n constructs the objects in-place, instead of just copying the value to them. This allows to obtain fully constructed copies into a range of uninitialized memory, such as a memory block obtained by a call to get_temporary_buffer or malloc." – leander May 05 '13 at 14:45
1

@BenjaminLindley: Indeed, nothing for trivial types - except that the name makes it clearer that an uninitialized memory region will be filled. – Andy Prowl May 05 '13 at 14:46
@leander: Yes, but we're dealing with `int`s here. This is why Benjamin is objecting the use of `std::uninitialized_fill_n` I believe – Andy Prowl May 05 '13 at 14:46
2

I tested both the OP solution and `std::uninitialized_fill` with `BIGNUMBER= 1<<31` on Linux with `gcc-4.7 -O3` and there is no significant difference in performance. – Andrew Tomazos May 05 '13 at 14:49
2

For non trivial types though, it would be undefined behavior, because the memory would already be initialized with valid objects. You would have to use `fill_n`. That's my (small) objection, it's more consistent for all types. But I understand your reasoning too. – Benjamin Lindley May 05 '13 at 14:49
@BenjaminLindley: Yes, that's true. I tried to expand the answer a bit, thank you for contributing – Andy Prowl May 05 '13 at 14:59

hmjd · Answer 2 · 2013-05-05T15:10:45.443

4

Alternative to a dynamic array is std::vector<int> with the constructor that accepts an initial value for each element:

std::vector<int> v(BIGNUMBER, 1); // 'BIGNUMBER' elements, all with value 1.

as already stated, performance would need measured. This approach provides the additional benefit that the memory will be freed automatically.

edited May 05 '13 at 15:10

answered May 05 '13 at 14:58

hmjd

120,187
20
207
252

but it can not be faster than the ops solution since it internally has to do the same tasks: 1st: allocate memory, 2nd: fill it with default value – vlad_tepesch May 05 '13 at 16:50

score 2 · Answer 3 · edited May 23 '17 at 12:09

2

Some possible alternatives to Andy Prowl's std::uninitialized_fill_n() solution, just for posterity:

If you are lucky and your value is composed of all the same bytes, memset will do the trick.
Some implementations offer a 16-bit version memsetw, but that's not everywhere.
GCC has an extension for Designated Initializers that can fill ranges.
I've worked with a few ARM systems that had libraries that had accelerated CPU and DMA variants of word-fill, hand coded in assembly -- you might look and see if your platform offers any of this, if you aren't terribly concerned about portability.
Depending on your processor, even looking into loops around SIMD intrinsics may provide a boost; some of the SIMD units have load/store pipelines that are optimized for moving data around like this. On the other hand you may take severe penalties for moving between register types.

Last but definitely not least, to echo some of the commenters: you should test and see. Compilers tend to be pretty good at recognizing and optimizing patterns like this -- you probably are just trading off portability or readability with anything other than the simple loop or uninitialized_fill_n.

You may be interested in prior questions:

edited May 23 '17 at 12:09

Community

1
1

answered May 05 '13 at 14:37

leander

8,527
1
30
43

1

`memset` has to deal with non-aligned writes, both at the begin and the end of the range. `std::fill_n` doesn't. A smart implementation of `memset` might in fact do up to 3 byte writes, then call `std::fill_n(aligned_ptr, value * 0x01010101U, n/4)`, and then finish with the last unaligned bytes. – MSalters May 06 '13 at 08:38
@MSalters: definitely true, but I've had some strange experiences here. For example, on one platform I've worked with, the per-byte copy (annoyingly not memcpy) and memset both had a huge amount of hand-optimized code: branch targets for different sizes, cacheline awareness, etc. The copy a word at a time was _slightly_ optimized, making use of all the available ARM registers, so it would read approximately 32 bytes at a time and write 32 bytes at a time, but it was not cacheline aware. It had a much lower icache footprint, though. – leander May 06 '13 at 14:30
@MSalters: the vast majority of the time, the "8-bit" copy was a lot faster, despite/because of all the branching and prepwork to find the beginning of a cacheline, deal with short copies, etc... Every once in a while, due to icache pollution or particularly small copy amounts, the word copy would win out. – leander May 06 '13 at 14:31
I'm not sure any of this makes a difference, and certainly dabbles on the maintainability/portability versus speed line, but I do like to point out possible alternatives when they exist. =) – leander May 06 '13 at 14:34
ARM, of all CPU's, has a faster byte copy?! A design where byte writes are basically read-modify-write operations on words? I'm surprised. – MSalters May 06 '13 at 14:39
Totally software - compiler/libs. I wrote custom asm on words that knew about cacheline align (for image manip) and it outperformed both. :) Also we were on a platform with ludicrously fast ram, albeit with only 16 bit bus. – leander May 06 '13 at 16:47
It was purely that someone had written all the extra cases for the copy8, where they didn't for copy32, so once it got past lead-in it could go a lot faster... To be strange the copy16 had none of the optimizations of either, and was just a tight single-read single-write loop. I guess the point is, as usual, disassemble and profile whenever anything is suspect -- but I've seen the pattern of the C funcs being hand tuned and the C++ side neglected more often than I'd like. – leander May 06 '13 at 16:48

score 1 · Answer 4 · answered May 05 '13 at 15:00

1

Under Linux/x86 gcc with optimizations turned on, your code will compile to the following:

rax = arr
rdi = BIGNUMBER

400690: c7 04 90 01 00 00 00    movl   $0x1,(%rax,%rdx,4)

Move immediate int(1) to rax + rdx

400697: 48 83 c2 01             add    $0x1,%rdx

Increment register rdx

40069b: 48 39 fa                cmp    %rdi,%rdx

Cmp rdi to rdx

40069e: 75 f0                   jne    400690 <main+0xa0>

If BIGNUMBER has been reached jump back to start.

It takes about 1 second per gigabyte on my machine, but most of that I bet is paging in physical memory to back the uninitialized allocation.

answered May 05 '13 at 15:00

Andrew Tomazos

66,139
40
186
319

Turn on more optimizations and things will become interesting ;-) – Marc Glisse May 05 '13 at 15:19
@MarcGlisse: That is with `-O3`, what additional optimizations do you propose to turn on? – Andrew Tomazos May 05 '13 at 16:07
`-march=native`, and maybe try a more recent compiler? – Marc Glisse May 05 '13 at 16:43
@MarcGlisse: Same result, I am using gcc 4.7. – Andrew Tomazos May 05 '13 at 16:46

score 1 · Answer 5 · answered May 05 '13 at 15:03

1

Just unroll the loop by, say, 8 or 16 times. Functions like memcpy are fast, but they're really there for convenience, not to be faster than anything you could possibly write:

for (i = 0; i < BIGNUMBER-8; i += 8){
  a[i+0] = 1; // this gets rid of the test against BIGNUMBER, and the increment, on 7 out of 8 items.
  a[i+1] = 1; // the compiler should be able to see that a[i] is being calculated repeatedly here
  ...
  a[i+7] = 1;
}
for (; i < BIGNUMBER; i++) a[i] = 1;

The compiler might be able to unroll the loop for you, but why take the chance?

answered May 05 '13 at 15:03

Mike Dunlavey

40,059
14
91
135

@Marc: How do you figure that? – Mike Dunlavey May 05 '13 at 19:30
1

Compilers come with a piece of optimization called the vectorizer. It can recognize loops that do the same thing on all items of an array, generate a prolog/epilog to handle alignment, and generate vector instructions for the main loop. If a is not (known to be) well aligned, your unrolling just made it very hard to vectorize (or maybe it will vectorize using slower unaligned writes). The compiler can also unroll loops by the right amount depending on the target cpu. Plus this is less readable for humans ;-) – Marc Glisse May 05 '13 at 21:17
I would say look at the assembly first; before you massage your code hoping it will get vectorized. Last time I tried, it was easier just vectorizing myself, GCC refused to play along that particular time. – Skurmedel May 05 '13 at 23:06
@Marc: Maybe there are some programmers who can't read that. I hope not ;-) But your point is well taken. It's always worth looking at the assembly language the compiler generates, and teasing it into generating good stuff. Thanks. – Mike Dunlavey May 06 '13 at 20:44

score -2 · Answer 6 · answered May 05 '13 at 14:30

-2

Use memset or memcpy

memset(arr, 0, BIGNUMER);

answered May 05 '13 at 14:30

Thang Do

316
2
16

3

`memset` will not work for non-zero numbers, because it works on a byte level. `memcpy` is probably going to be slower than a manual loop, because it will need to keep referencing other memory as well as writing – Dave May 05 '13 at 14:32
@Dave: you mean non-byte values? `memset` definitely works with arbitrary unsigned chars. – user268396 May 05 '13 at 14:37
@Dave also memcpy will be slower because you will have to copy from somewhere, meaning that you would have to fill an array with 1's and then use memcpy on that. That would seem silly next to just filling up your array with 1's to begin with. – SirGuy May 05 '13 at 14:39
@user268396 No, I mean `memset(arr,1,BIGNUMBER)` will fill the array with 16843009 instead of 1. – Dave May 05 '13 at 14:43
@Dave: that is a subtle type issue when reading back out, because of the fact an `int` is made up of more bytes than one `unsigned char`. However as it was, your comment seemed to imply that `memset(arr, 0xA, BIGNUMBER)` would fail to fill the array with `AAAA` (assuming 32bit ints). Which most definitely does work. ;) – user268396 May 05 '13 at 15:20

score -2 · Answer 7 · answered May 05 '13 at 14:33

-2

Try using memset?

memset(arr, 1, BIGNUMBER);

http://www.cplusplus.com/reference/cstring/memset/

answered May 05 '13 at 14:33

Daniel Price

26
1

`memset` will not work for non-zero numbers, because it works on a byte level. – Dave May 05 '13 at 14:34
Then you must have been using it on `char` arrays. On an `int` array (as in the question), it will produce wrong results. e.g. `memset(arr,1,...)` will set the numbers to 16843009 (assuming a 32-bit `int`). For a float array, the results will be even less predictable. – Dave May 21 '13 at 10:39

score -2 · Answer 8 · answered May 05 '13 at 15:08

-2

memset(arr, 1, sizeof(int) * BIGNUMBER);

answered May 05 '13 at 15:08

Anonymous

11

2

**this is not the same** memset sets every Byte to the the given value (that is treaten as uint8_t), so every int in your vector has the value 16.843.009 after that operarion – vlad_tepesch May 05 '13 at 16:53

How to fast initialize with 1 really big array

8 Answers8

Linked