Initializing Billion Integers to value 1

Question

What is good posix thread design to initialize billion integers using c/c++ on linux platform 8-core CPU with 32GB of DRAM? Thanks for your help.

1 billion == 1G; integers ~~ 4GB (8GB, 16GB?); 32MB is waaaaaaaaaaaaaaaay too little — pmg, Dec 01 '11 at 16:03
Just wondering: If the bottleneck is the memory b/w, I assume that multi-threading might be even slower? Have you checked whether the performance is limited by memory bandwidth or by computation of the initial values? Mhh, well, maybe you can write to multiple different RAM regions concurrently and get even faster results..anyone knows some details? — zerm, Dec 01 '11 at 16:08
@zerm, make that an answer - for such a trivial operation memory bandwidth will certainly be the bottleneck and one thread will be the fastest. — Mark Ransom, Dec 01 '11 at 16:26
Since yesterday I am reading about the very same situation in a book i downloaded ,called "What Every Programmer Should Know About Memory". Haven't finished yet. But maybe it gives you the answer yuo're looking for. — engf-010, Dec 01 '11 at 16:35
@TomasT., `memset` works with bytes, not integers. It can't create the necessary pattern. — Mark Ransom, Dec 01 '11 at 18:54
@Tibor: Why are you asking? IMHO, it is a bad design... Why can't you allocate progressively space for your data??? — Basile Starynkevitch, Dec 01 '11 at 20:41

score 8 · Answer 1 · answered Dec 01 '11 at 16:32

8

This is a trivial operation and you need not consider multi-threading. Just do it with a memcpy in a single thread.

answered Dec 01 '11 at 16:32

David Heffernan

601,492
42
1,072
1,490

2

Since I know integers are contiguous, I can as well use memset() to initialize to 1 in single thread, Thanks David! – Tibor Jiri Dec 01 '11 at 16:35
1

@TiborJiri And how do you expect to initialize integers (other than to 0) with `memset`. Try `std::fill`. – James Kanze Dec 01 '11 at 16:40
@JamesKanze, You're correct that `memset` can't initialize integers, but I suspect `std::fill` isn't the fastest way to go. – Mark Ransom Dec 01 '11 at 16:55
This solution likely offers the best bang per buck. – Brian Cain Dec 01 '11 at 17:09
@MarkRansom If it isn't the fastest solution, then there's something seriously wrong with the library and/or the compiler. – James Kanze Dec 02 '11 at 11:05
en route to learning about/using memcpy, what are some recommendable references in addition to the documentation http://www.cplusplus.com/reference/cstring/memcpy/ – Quetzalcoatl Jun 28 '20 at 19:49
@Quetz There is nothing more to say than what is in the documentation – David Heffernan Jun 28 '20 at 19:52

score 4 · Answer 2 · answered Dec 01 '11 at 16:23

The exact number of threads will not be such a limiting factor, but sometimes for this questions it is worth to overcommit, say use 2 threads per physical core.

But the real bottleneck will be IO, writing the data into the RAM. You'd have to take care that the data that is to be replaced will never read before you erase it. Then you should assure that writes to memory appear in large chunks and (if possible) as "write through", mondern CPU have instructions for the later.

Usually something like memcpy with a fixed sized buffer (some pages) that contains the pattern that you want to see should be optimized quite well.

score 2 · Answer 3 · answered Dec 01 '11 at 16:55

What is that for? Depending on usage, the following scenario might work: you initialize one memory page (that's several KB) to all 1's. Then you map that page into the virtual address space as many times as needed with a copy-on-write flag. This way, on reading you'll get all ones from all those virtual pages, on writing the system will allocate more physical pages as needed.

score 1 · Answer 4 · answered Dec 01 '11 at 16:04

1

Perhaps a divide and conquer algorithm? Partition the memory containing the integers by some number corresponding to the number of threads optimal for your system. Then launch one thread per partition which initializes all of its integers.

answered Dec 01 '11 at 16:04

maerics

151,642
46
269
291

The question is how many threads? are 8 threads sufficient, or 16 or how many? If we use many threads (> 100) I am thinking performance will be adversely impacted on Linux. – Tibor Jiri Dec 01 '11 at 16:08
4

@TiborJiri: the number of threads depends on many factors relating almost exclusively to the target system. Likely you will have to do some testing to find the right value but you can design your algorithm so that the number is a parameter which can be trivially changed. – maerics Dec 01 '11 at 16:09
@Tibor: What maerics said, and you might want to look at http://stackoverflow.com/questions/1715580/how-to-discover-number-of-cores-on-mac-os-x for an answer relating to Macs. It might help you search for an answer relating to your platform. – Mike DeSimone Dec 01 '11 at 16:16
Why the downvote? I realize that a memcpy solution is better but OP specifically asked for a multithreading solution... – maerics Dec 01 '11 at 17:07

score 0 · Answer 5 · answered Dec 01 '11 at 16:46

If you do attempt multithreading, aligning your writes with the native cache line size will likely provide optimal memory throughput. As everyone says, the memory throughput will dominate the performance but there is some portion of CPU time required for these writes. Minimizing that time with multithreading and vectorized instructions may be helpful.

The real answer is to profile your system (since you stated a very specific target, it sounds like you don't want to design a balanced algorithm which is good enough for most targets). Modern CPUs which have access to 32GB of DRAM often have hardware performance counters (Intel's and AMD's do) which make finding out CPU, caching activity pretty easy.

Initializing Billion Integers to value 1

5 Answers5