3

While compiling under linux I use flag -j16 as i have 16 cores. I am just wondering if it makes any sense to use sth like -j32. Actually this is a quesiton about scheduling of processor time and if it is possible to put more pressure on particular process than any other this way (let say i have like to pararell compilations each with -j16 and what if one would be -j32?). I think it does not make much sense but I am not sure as do not know how kernel solves such things.

Kind regards,

Lormitto
  • 477
  • 2
  • 7
  • 19

3 Answers3

8

I use a non-recursive build system based on GNU make and I was wondering how well it scales.

I ran benchmarks on a 6-core Intel CPU with hyper-threading. I measured compile times using -j1 to -j20. For each -j option make ran three times and the shortest time was recorded. Using -j9 results in shortest compile time, 11% better than -j6.

In other words, hyper-threading does help a little, and an optimal formula for Intel processors with hyper-threading is number_of_cores * 1.5:

enter image description here

Chart data is here.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Could that have been I/O limited? I'm surprised we barely see any scaling past the number of physical cores in that 6c12t CPU. Or maybe memory bandwidth or cache footprint was a limiting factor. Another answer from 2013 on a 4c8t laptop is fairly consistent, though: [GNU make: should the number of jobs equal the number of CPU cores in a system?](https://stackoverflow.com/q/2499070) shows the wall-clock time barely improves past number of physical cores, and user (CPU) time goes up to about double the 1-core time on a system with lots of memory and a fast SSD. – Peter Cordes Feb 28 '23 at 21:57
  • 1
    @PeterCordes It absolutely could, that host rented in a datacentre was indeed running HDDs, IIRC. It had enough RAM, however, to keep all the source files being compiled in the page cache, so that this test was I/O wait free. I made sure the compile times were free of I/O waits to measure CPU scalability. – Maxim Egorushkin Mar 10 '23 at 03:28
  • @PeterCordes It may be highly dependent of the project due to the number of includes per source. The project I benchmarked was small and cognizant of minimizing includes. After that, I worked on projects were compiling a 1,000-line C++ source file included 500,000 lines of header files, which would probably result in different numbers. Hence, the best advice, as always, is to benchmark your specific project. If hyper-threading helps - that's ideal. – Maxim Egorushkin Mar 10 '23 at 03:42
  • 1
    Yeah, seems to be a real effect, not just I/O limits. Makes me wonder if memory bandwidth was part of the bottleneck; that's been growing recently (especially on desktop CPUs faster than core counts, unlike servers). Or if compiler internals have more ILP and have fewer branch-misses than I thought they might. I should probably test some on my Skylake, especially for `-O2` / `-O3` builds, but even that is getting kinda old by modern standards, like half the memory bandwidth of the fastest DDR5, and a narrower pipeline. (The wider the pipeline, the more SMT can help keep it busy.) – Peter Cordes Mar 10 '23 at 03:45
  • @PeterCordes Toolchanins have moved on far since then. The best advice is still the canonical "benchmark your particular project". However, I always try to take advantage of hyper-threading for compute-instensive tasks to squeeze out that potential +25% performance, while minimizing RSS usage to make sure those HT threads don't cause extra L3 cache misses, which negate benefits of using extra threads. Sometimes I have to use fewer cores to avoid L3 misses. – Maxim Egorushkin Mar 10 '23 at 03:54
  • @PeterCordes For compute-instensive tasks I prefer throwing all available cores onto one same dataset, rather than different cores for different datasets. Those modern 64+ core CPUs are amazing, but, again, they still have the same amount of L3 cache per core as their desktop siblings, so that throwing 64 compute-instensive tasks on a server 64-core 2.2GHz CPU ends up taking more time than throwing these sames 64 tasks as batches of 32 tasks onto a 4.5Ghz 32-core desktop CPU. That probably have nothing to do with your comment, but those L3 misses are a limiting factor to mind for me these days – Maxim Egorushkin Mar 10 '23 at 04:12
  • 1
    Yeah, absolutely, with separate compile tasks 2 per physical core, you only have half the per-core L3. But note that `make -j` beyond even the number of logical cores doesn't continue to degrade, even with presumably-rapid context-switching competing even more for L3 capacity. Or maybe 10 ms is long enough for L3 to refill without huge impact between switches. If I was curious enough I'd run some experiments with `perf stat` on `gcc` and/or `clang` myself to see what kind of hit rate it gets with 2MiB of L3 per physical core on my system. – Peter Cordes Mar 10 '23 at 04:31
  • @PeterCordes `perf stat` makes perfect sense. I ran varied compute intensive tasks and use `sudo perf top -Mintel -r10 -m8M -c5000 --delay 3 --sort comm,dso,symbol` for quick hot-spot identification in production systems, high instruction cycle counts often indicate cache misses. And then focus in with `perf stat`. – Maxim Egorushkin Mar 10 '23 at 04:50
  • 1
    or `perf stat -e task-clock,cycles,instructions,branches,branch-misses,mem_load_retired.l3_miss,mem_load_retired.l2_miss` or something like that since I was curious about actual L3 hit/miss rate, and how much traffic makes it past the per-core L2 caches. – Peter Cordes Mar 10 '23 at 04:53
  • @PeterCordes Your command is very specific about counters, not every CPU provides them. `perf stat -ddd` seems like a good alternative for laymen like me, does it not? – Maxim Egorushkin Jun 08 '23 at 23:00
  • Yes, `perf stat -ddd` is fairly reasonable, although it measure L1 and L3 (actually last-level) accesses & misses, not L2. But L3 accesses is about the same as L3 misses. It also counts iTLB and dTLB, so it's a lot of events to count, so it has to multiplex the events onto the counters and each one isn't up for much of the time, leading to higher errors from extrapolating if the workload isn't doing the same thing the whole time. – Peter Cordes Jun 08 '23 at 23:26
  • @PeterCordes Good point about the limited number of performance counters, 4 counters is what my workstation Zen 3 CPU has. I am particularly interested in dTLB misses for my linear algebra compute tasks. I get +5-15% performance for free just by using `transparent_hugepage=always` in the kernel command line and squeeze another several percent by explicitly placing my large datasets into 1GB huge pages shared between worker processes. The latter reduced `stalled-cycles-frontend` by factor of 2-3x for me. AMD docs state that 1GB TLB is handled differently, but I cannot find more details. – Maxim Egorushkin Jun 08 '23 at 23:38
  • @PeterCordes Placing code into huge pages or loading a `.so` into huge pages is still a non-trivial task, unfortunately, if I am not mistaken. – Maxim Egorushkin Jun 08 '23 at 23:47
  • @PeterCordes AMD docs state that 1GB TLB is handled differently and I can clearly see that they don't get counted in `dTLB-loads` of `dTLB-load-misses`, unlike for 4k and 2M page sizes. That's why I wonder if I can get more visibility into 1GB TLB stats and how 1GB TLB is handled. – Maxim Egorushkin Jun 08 '23 at 23:51
  • @PeterCordes Yes, I do these as well for good measure and more https://github.com/max0x7ba/thp-usage/blob/main/thp-always.service.d/thp-always.sh – Maxim Egorushkin Jun 08 '23 at 23:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254008/discussion-between-peter-cordes-and-maxim-egorushkin). – Peter Cordes Jun 09 '23 at 00:00
  • 1
    Typo earlier, I meant to say *L3 accesses is about the same as **L2** misses*, so the counters from `perf stat -ddd` do give you some information about L2 hit/miss rate if you look at L1 missed and L3 accesses. – Peter Cordes Jun 09 '23 at 00:02
  • @PeterCordes `libhugetlbfs` copies an executable into the 2MB/1GB pages and runs it from there, provided it has been linked to align sections/segments on these boundaries and reduce the number of used segments to 3. I did that with statically linked executables in the past. I should try applying that to my `.so` loaded into Python, that could do the trick. – Maxim Egorushkin Jun 09 '23 at 00:06
  • 1
    https://www.phoronix.com/review/amd-epyc-9754-smt tested SMT on/off on an AMD Bergamo with 128 Zen4c cores. Compile times with clang and GCC were worse with SMT enabled vs. disabled. (With plenty of RAM, and sources hot in disk cache, and already plenty of parallelism in the no-SMT build, since that's a huge number of physical cores.) So your results are consistent with that. Probably having a couple more jobs than cores helps fill gaps. On a bigger system, it might still be good to do something like `n = cores + 3` rather than `n = cores * 1.5`, or `cores*1.1 + 2` or something round up. – Peter Cordes Jul 27 '23 at 19:55
  • @PeterCordes I'd say Phoronix 2023 results are consistent with my 2013 results, not the other way around :). – Maxim Egorushkin Jul 29 '23 at 15:53
  • @PeterCordes Why would one ever disable SMT in the BIOS (Phoronix article advice) instead of `taskset`int the application to run on non-SMT cores only? – Maxim Egorushkin Jul 29 '23 at 16:10
  • If the workloads you care about are all bad with SMT (e.g. a server that runs the same workload all day every day), then disabling the other logical cores means half as many cores for the kernel to manage, and stuff like RCU `run_on` never has to wake up the other logical core so has half as much work to do waiting for all cores. The cores can stay in one-thread-active mode permanently. It's also 100% reliable as a benchmarking method. And some software may detect the total number of logical cores when deciding how many threads to start, without checking affinity masks. – Peter Cordes Jul 29 '23 at 18:57
  • If none of those are true, like most desktop use-cases with software you don't have to fight against to use the desired number of cores, then sure leave SMT enabled for the cases where it helps. – Peter Cordes Jul 29 '23 at 18:58
  • @PeterCordes Good points with reliable benchmarking and software incorrectly detecting the number of available cores. – Maxim Egorushkin Jul 29 '23 at 19:07
2

The rule of thumb is to use the number of processors+1. Hyper-Thready counts, so a quad core CPU with HT should have -j9

Setting the value too high is counter-productive, if you do want to speed up compile times consider ccache to cache compiled objects that do not change in each compilation, and distcc to distribute the compilation across several machines.

RJLouro
  • 54
  • 3
  • What is the basis of your rule of thumb please? – Maxim Egorushkin Jul 19 '13 at 15:17
  • 1
    Many years or running Gentoo Linux. Here's a sample from their documentation: http://en.gentoo-wiki.com/wiki/Portage_tips#Compilation_process_optimization_on_multicore_systems Note that this has nothing to do with Gentoo, it's a make option. – RJLouro Jul 19 '13 at 15:33
  • Do you have any data to support your claim? [My empirical results](http://stackoverflow.com/a/17749621/412080) disagree with your rule of thumb. – Maxim Egorushkin Jul 24 '13 at 08:59
  • Your empirical results do not disagree with what I said and are just an example of one system with a samplecode. Still the "rule of thumb", (number of cores+1, double the cores if hyper-threading is enabled) in your case would be -j13 which is only 0.5 seconds slower than the -j9 you point out as optimal. – RJLouro Jul 25 '13 at 14:18
  • Well, my point was that you have to base something on observations, repeatable by other researchers, not _many years or running Gentoo Linux_. And to say that my empirical results agree or not, I would have to run each test at least 30 times and see whether the differences are statistically significant. Again, such statements would need to be supported by data. – Maxim Egorushkin Jul 25 '13 at 14:42
  • 2
    I agree with your statement on including resources, but this is not a thesis, it's a help community website. It's better to include an answer which is 90% complete in 5 seconds of posting the question, than to wait several hours for one that is 100% complete. If my answer was wrong I'd have deleted it, which is not the case. Keeping this discussion after one week is pointless. This is my last reply on this discussion. – RJLouro Jul 27 '13 at 13:45
2

We have a machine in our shop with the following characteristics:

  • 256 core sparc solaris
  • ~64gb RAM
  • Some of that memory used for a ram drive for /tmp

Back when it was originally setup, before other users discovered its existence, I ran some timing tests to see how far I could push it. The build in question is non-recursive, so all jobs are kicked off from a single make process. I also cloned my repo into /tmp to take advantage of the ram drive.

I saw improvements up to -j56. Beyond that my results flat lined much like Maxim's graph, until somewhere above (roughly) -j75 where performance began to degrade. Running multiple parallel builds I could push it beyond the apparent cap of -j56.

The primary make process is single-threaded; after running some tests I realized the ceiling I was hitting had to do with how many child processes the primary thread could service -- which was further hampered by anything in the makefiles that either required extra time to parse (eg., using = instead of := to avoid unnecessary delayed evaluation, complex user defined macros, etc) or used things like $(shell).

These are the things I've been able to do to speed up builds that have a noticeable impact:

Use := wherever possible

If you assign to a variable once with :=, then later with +=, it'll continue to use immediate evaluation. However, ?= and +=, when a variable hasn't been assigned previously, will always delay evaluation.

Delayed evaluation doesn't seem like a big deal until you have a large enough build. If a variable (like CFLAGS) doesn't change after all the makefiles have been parsed, then you probably don't want to use delayed evaluation on it (and if you do, you probably already know enough about what I'm talking about anyway to ignore my advice).

If you create macros you execute with the $(call) facility, try to do as much of the evaluation ahead of time as possible

I once got it in my head to create macros of the form:

IFLINUX = $(strip $(if $(filter Linux,$(shell uname)),$(1),$(2)))
IFCLANG = $(strip $(if $(filter-out undefined,$(origin CLANG_BUILD)),$(1),$(2)))
...
# an example of how I might have made the worst use of it
CXXFLAGS = ${whatever flags} $(call IFCLANG,-fsanitize=undefined)

This build produces over 10,000 object files, about 8,000 of which are from C++ code. Had I used CXXFLAGS := (...), it would only need to immediately replace ${CXXFLAGS} in all of the compile steps with the already evaluated text. Instead it must re-evaluate the text of that variable once for each compile step.

An alternative implementation that can at least help mitigate some of the re-evaluation if you have no choice:

ifneq 'undefined' '$(origin CLANG_BUILD)'
IFCLANG = $(strip $(1))
else
IFCLANG = $(strip $(2))
endif

... though that only helps avoid the repeated $(origin) and $(if) calls; you'd still have to follow the advice about using := wherever possible.

Where possible, avoid using custom macros inside recipes

The reasoning should be pretty obvious here after the above; anything that requires a variable or macro to be repeatedly evaluated for every compile/link step will degrade your build speed. Every macro/variable evaluation occurs in the same thread as what kicks off new jobs, so any time spent parsing is time make delays kicking off another parallel job.

I put some recipes in custom macros whenever it promotes code re-use and/or improves readability, but I try to keep it to a minimum.

Brian Vandenberg
  • 4,011
  • 2
  • 37
  • 53
  • 1
    Interesting observations. Ninja's major premise was to remove all those expansions and shell steps that always happen before even starting to detect out-of-date targets that need to be recompiled. However, I am yet to see ninja to outperform a non-recursive make written using best practices. A good test is how long it takes to build nothing when every target is up-to-date. My own non-recursive make takes under 1 second for a C++ project with over 10,000 source files. – Maxim Egorushkin Mar 04 '16 at 22:43