GNU make: should the number of jobs equal the number of CPU cores in a system?

Question

There seems to be some controversy on whether the number of jobs in GNU make is supposed to be equal to the number of cores, or if you can optimize the build time by adding one extra job that can be queued up while the others "work".

Is it better to use -j4 or -j5 on a quad core system?

Have you seen (or done) any benchmarking that supports one or the other?

Just for the tip, you can use `make \`nproc\`` to make CPU independant script :) — VivienG, Dec 18 '15 at 13:57
If you have a mix of recipes that are io-bound and cpu-bound, then you're potentially going to want many more than NCPUs. Consider also adding -lX options. This isn't really an answerable question, other than "it depends on your hardware and make tasks." — James Moore, Jul 10 '17 at 17:48
It is technically possible to see an improvement. You need a slow disk, not enough ram and lots of small source code files. Easier to come by a decade ago. — Hans Passant, Nov 10 '18 at 22:44
Crosslink: same question on UNIX&Linux SE [linux - How to determine the maximum number to pass to make -j option? - Unix & Linux Stack Exchange](https://unix.stackexchange.com/questions/208568/how-to-determine-the-maximum-number-to-pass-to-make-j-option) — user202729, May 19 '23 at 03:16

score 65 · Answer 1 · answered Sep 17 '13 at 15:01

I've run my home project on my 4-core with hyperthreading laptop and recorded the results. This is a fairly compiler-heavy project but it includes a unit test of 17.7 seconds at the end. The compiles are not very IO intensive; there is very much memory available and if not the rest is on a fast SSD.

1 job        real   2m27.929s    user   2m11.352s    sys    0m11.964s    
2 jobs       real   1m22.901s    user   2m13.800s    sys    0m9.532s
3 jobs       real   1m6.434s     user   2m29.024s    sys    0m10.532s
4 jobs       real   0m59.847s    user   2m50.336s    sys    0m12.656s
5 jobs       real   0m58.657s    user   3m24.384s    sys    0m14.112s
6 jobs       real   0m57.100s    user   3m51.776s    sys    0m16.128s
7 jobs       real   0m56.304s    user   4m15.500s    sys    0m16.992s
8 jobs       real   0m53.513s    user   4m38.456s    sys    0m17.724s
9 jobs       real   0m53.371s    user   4m37.344s    sys    0m17.676s
10 jobs      real   0m53.350s    user   4m37.384s    sys    0m17.752s
11 jobs      real   0m53.834s    user   4m43.644s    sys    0m18.568s
12 jobs      real   0m52.187s    user   4m32.400s    sys    0m17.476s
13 jobs      real   0m53.834s    user   4m40.900s    sys    0m17.660s
14 jobs      real   0m53.901s    user   4m37.076s    sys    0m17.408s
15 jobs      real   0m55.975s    user   4m43.588s    sys    0m18.504s
16 jobs      real   0m53.764s    user   4m40.856s    sys    0m18.244s
inf jobs     real   0m51.812s    user   4m21.200s    sys    0m16.812s

Basic results:

Scaling to the core count increases the performance nearly linearly. The real time went down from 2.5 minutes to 1.0 minute (2.5x as fast), but the time taken during compile went up from 2.11 to 2.50 minutes. The system noticed barely any additional load in this bit.
Scaling from the core count to the thread count increased the user load immensely, from 2.50 minutes to 4.38 minutes. This near doubling is most likely because the other compiler instances wanted to use the same CPU resources at the same time. The system is getting a bit more loaded with requests and task switching, causing it to go to 17.7 seconds of time used. The advantage is about 6.5 seconds on a compile time of 53.5 seconds, making for a 12% speedup.
Scaling from thread count to double thread count gave no significant speedup. The times at 12 and 15 are most likely statistical anomalies that you can disregard. The total time taken increases ever so slightly, as does the system time. Both are most likely due to increased task switching. There is no benefit to this.

My guess right now: If you do something else on your computer, use the core count. If you do not, use the thread count. Exceeding it shows no benefit. At some point they will become memory limited and collapse due to that, making the compiling much slower. The "inf" line was added at a much later date, giving me the suspicion that there was some thermal throttling for the 8+ jobs. This does show that for this project size there's no memory or throughput limit in effect. It's a small project though, given 8GB of memory to compile in.

According to https://stackoverflow.com/questions/56272639/400-threads-in-20-processes-outperform-400-threads-in-4-processes-while-performi/56273407#56273407, you can get an advantage running more tasks than you have CPUs but only if your tasks spend a significant share of time waiting for network I/O. For compilation tasks, this is not the case though. — ivan_pozdeev, Jul 23 '19 at 21:49
I would recommend total count of available hardware threads + 1 to allow I/O to overlap CPU usage. If the system is not dedicated (e.g. you're using a desktop with web browser) then you want smaller number as suggested in the answer. — Mikko Rantalainen, Aug 16 '22 at 19:41
When you say 4 core with hyperthreading, you mean 4 physical cores, so 4c8t? Yeah, that would be consistent with your "user" time going up to twice the 1-core total, if hyperthreading didn't give any speedup. What CPU model, if you still remember? Older CPUs may benefit less from hyperthreading for some workloads, since they're not as wide and have smaller caches. — Peter Cordes, Feb 28 '23 at 21:56
In 2013 the terminology for 4c8t was 4 cores with hyperthreading, so yes. Efficiency cores hadn't been invented yet. It was a Intel Core i5-520M. It did have a 2x 8-way associative data cache, so that would slightly limit throughput on 8 threads. I should rerun this benchmark, to get new numbers and because of the 10 year anniversary of the comment :D — dascandy, Feb 28 '23 at 22:06
Thanks. That's a Nehalem; the generation before Sandybridge introduced a uop cache and other improvements. [How to speed up compilation time in linux](https://stackoverflow.com/q/17743547) has some timings from an i7-3930K (Sandbybridge-E). But actually your 10% speedup (`(59.8-53.5) / 59.8`) from `-j4` to `-j8` is about the same as Maxim's 12% from `-j6` to `-j9` which proved to be the sweet spot on his 6c12t machine. — Peter Cordes, Feb 28 '23 at 22:30
I'd expected compilers to benefit more from HT; maybe they branch mispredict and cache miss a lot less than I thought. (Or maybe competition for BP and cache makes things worse enough to offset.) https://openbenchmarking.org/test/pts/build-gcc shows a 16c32t Ryzen 9 7950X (490 sec) with a big lead over a 16c24t Alder Lake i9-12900K (616 sec), but all of the Zen4's cores are big cores, none efficiency cores. OTOH, more of that Alder Lakes threads have a whole E-core to themselves... I'd need to analyze more data more carefully, like two Alder Lakes with same P-cores, more E-cores. — Peter Cordes, Feb 28 '23 at 22:39
https://www.phoronix.com/review/amd-epyc-9754-smt/6 tested SMT on/off on an AMD Bergamo with 128 Zen4c cores. Compile times with clang and GCC were worse with SMT enabled vs. disabled. (With plenty of RAM, and sources hot in disk cache, and already plenty of parallelism in the no-SMT build, since that's a huge number of physical cores.) — Peter Cordes, Jul 27 '23 at 19:57

score 61 · Accepted Answer · answered Mar 23 '10 at 11:53

I would say the best thing to do is benchmark it yourself on your particular environment and workload. Seems like there are too many variables (size/number of source files, available memory, disk caching, whether your source directory & system headers are located on different disks, etc.) for a one-size-fits-all answer.

My personal experience (on a 2-core MacBook Pro) is that -j2 is significantly faster than -j1, but beyond that (-j3, -j4 etc.) there's no measurable speedup. So for my environment "jobs == number of cores" seems to be a good answer. (YMMV)

score 32 · Answer 3 · edited May 20 '14 at 19:17

32

I, personally, use make -j n where n is "number of cores" + 1.

I can't, however, give a scientific explanation: I've seen a lot of people using the same settings and they gave me pretty good results so far.

Anyway, you have to be careful because some make-chains simply aren't compatible with the --jobs option, and can lead to unexpected results. If you're experiencing strange dependency errors, just try to make without --jobs.

edited May 20 '14 at 19:17

s g

5,289
10
49
82

answered Mar 23 '10 at 14:46

ereOn

53,676
39
161
238

21

The explanation (can't vouch for its scientificness though) is that "+ 1" gives an extra job that runs while anyone of the other n jobs is doing I/O. – Laurynas Biveinis Mar 24 '10 at 11:56
@LaurynasBiveinis: But then the jobs are running on different cores all the time, at least more often than with a more conservative setting where a job is given the chance to stay on the same core for a longer period of time. There are pros and cons here... – krlmlr Jun 08 '12 at 08:47
1

Number-of-cores + 1 is my default setting too. One issue is that, in any reasonably large system, make seems to delay linking and do _all_ the link steps together. At this point you run out of RAM. Bah! – bobbogo Jan 16 '14 at 22:04
6

some make-chains simply aren't compatible with the --jobs option -> This means you've got missing dependencies. Fix your makefiles if you ever get this. – dascandy Sep 27 '15 at 18:03
The idea with +1 is to try to start one process which will stall to complete I/O and while waiting for the I/O another (already having completed I/O) can be run on the CPU. If your CPU is hugely faster than your I/O devices and you have enough RAM, then setting `-j` to much higher than core count might make sense, too, because then I/O would be the bottleneck for the whole time. – Mikko Rantalainen Aug 16 '22 at 19:46

score 12 · Answer 4 · answered Jun 15 '19 at 06:12

Both are not wrong. To be at peace with yourself and with author of software you're compiling (different multi-thread/single-thread restrictions apply at software level itself), I suggest you use:

make -j`nproc`

Notes: nproc is linux command that will return number of cores/threads(modern CPU) available on system. Placing it under ticks ` like above will pass the number to the make command.

Additional info: As someone mentioned, using all cores/threads to compile software can literally choke your box to near death (being unresponsive) and might even take longer than using less cores. As I seen one Slackware user here posted he had dual core CPU but still provided testing up to j 8, which stopped being different at j 2 (only 2 hardware cores that CPU can utilize). So, to avoid unresponsive box i suggest you run it like this:

make -j`nproc --ignore=2`

This will pass the output of nproc to make and subtract 2 cores from its result.

If you write this in a script, I would recommend syntax `make -j$(nproc --ignore=2)` because it's easier to see than backticks. — Mikko Rantalainen, Aug 16 '22 at 19:43

score 7 · Answer 5 · answered Mar 23 '10 at 19:12

Ultimately, you'll have to do some benchmarks to determine the best number to use for your build, but remember that the CPU isn't the only resource that matters!

If you've got a build that relies heavily on the disk, for example, then spawning lots of jobs on a multicore system might actually be slower, as the disk will have to do extra work moving the disk head back and forth to serve all the different jobs (depending on lots of factors, like how well the OS handles the disk-cache, native command queuing support by the disk, etc.).

And then you've got "real" cores versus hyper-threading. You may or may not benefit from spawning jobs for each hyper-thread. Again, you'll have to benchmark to find out.

I can't say I've specifically tried #cores + 1, but on our systems (Intel i7 940, 4 hyperthreaded cores, lots of RAM, and VelociRaptor drives) and our build (large-scale C++ build that's alternately CPU and I/O bound) there is very little difference between -j4 and -j8. (It's maybe 15% better... but nowhere near twice as good.)

If I'm going away for lunch, I'll use -j8, but if I want to use my system for anything else while it's building, I'll use a lower number. :)

Seems great, but I'm confused why you wouldn't just take that +15% every time by using `-j 8` — s g, May 20 '14 at 19:19
@sg: j8 was really taxing on the system I described in my original post... the machine was still *usable*, but it was definitely less responsive. So if I still wanted to use it interactively for other tasks (typically working on other code, and maybe the occasional single-DLL build), I would reserve a couple of cores for the interactive bits. — ijprest, May 21 '14 at 00:22
@sg: This is less of a problem on our newer systems... I suspect it's mostly because we're running SSDs now. (I think we're entirely CPU-bound now that we're going to SSDs... we tried building entirely on a RAM-drive with almost no improvement.) But I will still leave a couple of cores free if I'm doing anything more than simple text editing in the foreground. — ijprest, May 21 '14 at 00:24

score 5 · Answer 6 · edited Jul 13 '13 at 11:41

I just got an Athlon II X2 Regor proc with a Foxconn M/B and 4GB of G-Skill memory.

I put my 'cat /proc/cpuinfo' and 'free' at the end of this so others can see my specs. It's a dual core Athlon II x2 with 4GB of RAM.

uname -a on default slackware 14.0 kernel is 3.2.45.

I downloaded the next step kernel source (linux-3.2.46) to /archive4;

extracted it (tar -xjvf linux-3.2.46.tar.bz2);

cd'd into the directory (cd linux-3.2.46);

and copied the default kernel's config over (cp /usr/src/linux/.config .);

used make oldconfig to prepare the 3.2.46 kernel config;

then ran make with various incantations of -jX.

I tested the timings of each run by issuing make after the time command, e.g., 'time make -j2'. Between each run I 'rm -rf' the linux-3.2.46 tree and reextracted it, copied the default /usr/src/linux/.config into the directory, ran make oldconfig and then did my 'make -jX' test again.

plain "make":

real    51m47.510s
user    47m52.228s
sys     3m44.985s
bob@Moses:/archive4/linux-3.2.46$

as above but with make -j2

real    27m3.194s
user    48m5.135s
sys     3m39.431s
bob@Moses:/archive4/linux-3.2.46$

as above but with make -j3

real    27m30.203s
user    48m43.821s
sys     3m42.309s
bob@Moses:/archive4/linux-3.2.46$

as above but with make -j4

real    27m32.023s
user    49m18.328s
sys     3m43.765s
bob@Moses:/archive4/linux-3.2.46$

as above but with make -j8

real    28m28.112s
user    50m34.445s
sys     3m49.877s
bob@Moses:/archive4/linux-3.2.46$

'cat /proc/cpuinfo' yields:

bob@Moses:/archive4$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 6
model name      : AMD Athlon(tm) II X2 270 Processor
stepping        : 3
microcode       : 0x10000c8
cpu MHz         : 3399.957
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips        : 6799.91
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 6
model name      : AMD Athlon(tm) II X2 270 Processor
stepping        : 3
microcode       : 0x10000c8
cpu MHz         : 3399.957
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm 3dnowext 3dnow constant_tsc nonstop_tsc extd_apicid pni monitor cx16 p
opcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips        : 6799.94
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

'free' yields:

bob@Moses:/archive4$ free
             total       used       free     shared    buffers     cached
Mem:       3991304    3834564     156740          0     519220    2515308

What does just `make -j` do on that system? Make is supposed to check the load and scale the number of processes based on load. — docwhat, Oct 21 '13 at 16:38
`make -j` doesn't limit the number of jobs at all. This is usually disastrous on a medium- or large-sized project as quickly more jobs are forked than can be supported by RAM. The option you need to restrict by load is `-l [load]`, in conjunction with `-j` — Matt Godbolt, May 05 '17 at 22:34

score 3 · Answer 7 · edited Jun 20 '20 at 09:12

3

Just as a ref:

From Spawning Multiple Build Jobs section in LKD:

where n is the number of jobs to spawn. Usual practice is to spawn one or two jobs per processor. For example, on a dual processor machine, one might do

$ make j4

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 07 '15 at 06:49

Nan Xiao

16,671
18
103
164

broken link, is this quote from Linux Kernel Development by Robert Love? – Behrooz Mar 14 '18 at 17:49
Yes, it is from that book. – Nan Xiao Mar 15 '18 at 00:59

score 3 · Answer 8 · edited Jul 27 '23 at 20:23

Many years later, the majority of these answers are still correct. However, there has been a bit of a change: Using more jobs than you have physical cores now gives a genuinely significant speedup. As an addendum to Dascandy's table, here's my times for compiling a project on a AMD Ryzen 5 3600X on linux. (The Powder Toy, commit c6f653ac3cef03acfbc44e8f29f11e1b301f1ca2)

I recommend checking yourself, but I've found with input from others that using your logical core count for job count works well on Zen. Alongside that, the system does not seem to lose responsiveness. I imagine this applies to recent Intel CPUs as well. Do note I have an SSD, as well, so it may be worth it to test your CPU yourself.

scons -j1 --release --native  120.68s user 9.78s system 99% cpu 2:10.60 total
scons -j2 --release --native  122.96s user 9.59s system 197% cpu 1:07.15 total
scons -j3 --release --native  125.62s user 9.75s system 292% cpu 46.291 total
scons -j4 --release --native  128.26s user 10.41s system 385% cpu 35.971 total
scons -j5 --release --native  133.73s user 10.33s system 476% cpu 30.241 total
scons -j6 --release --native  144.10s user 11.24s system 564% cpu 27.510 total
scons -j7 --release --native  153.64s user 11.61s system 653% cpu 25.297 total
scons -j8 --release --native  161.91s user 12.04s system 742% cpu 23.440 total
scons -j9 --release --native  169.09s user 12.38s system 827% cpu 21.923 total
scons -j10 --release --native  176.63s user 12.70s system 910% cpu 20.788 total
scons -j11 --release --native  184.57s user 13.18s system 989% cpu 19.976 total
scons -j12 --release --native  192.13s user 14.33s system 1055% cpu 19.553 total
scons -j13 --release --native  193.27s user 14.01s system 1052% cpu 19.698 total
scons -j14 --release --native  193.62s user 13.85s system 1076% cpu 19.270 total
scons -j15 --release --native  195.20s user 13.53s system 1056% cpu 19.755 total
scons -j16 --release --native  195.11s user 13.81s system 1060% cpu 19.692 total
( -jinf test not included, as it is not supported by scons.)

Tests done on Ubuntu 19.10 w/ a Ryzen 5 3600X, Samsung 860 Evo SSD (SATA), and 32GB RAM. This is a 6-core 12-thread Zen 2 CPU, 2x 16MiB of L3 cache across two CCXs. (3 cores per CCX sharing a 16MiB L3.)

With jobs = 6, build time is 27.5s. Speed keeps improving right up to 12 jobs, as many as there are logical cores, although 11 was almost as fast. 27.51 / 19.553 is a 1.4x speedup for 12 jobs vs. 6. As expected, the results plateau beyond that.

Final note: Other people with a 3600X may get better times than me. When doing this test, I had Eco mode enabled, reducing the CPU's speed a little.

https://www.phoronix.com/review/amd-epyc-9754-smt/6 tested SMT on/off on an AMD Bergamo with 128 Zen4c cores. Compile times with clang and GCC were worse with SMT enabled vs. disabled. (With plenty of RAM, and sources hot in disk cache, and already plenty of parallelism in the no-SMT build, since that's a huge number of physical cores. Perhaps even compared to the number of source files in some directories.)

Zen 4C cores have less L3 cache per core (16 cores per CCX instead of 8), potentially hurting SMT worse than on normal CPUs.

Also, CPU frequency may have been power-limited in that EPYC, so perhaps CPU frequencies were higher without SMT. Running in "Eco" mode might change that.

https://www.phoronix.com/review/amd-epyc-9754-smt/6 tested SMT on/off on an AMD Bergamo with 128 Zen4c cores. Compile times with clang and GCC were worse with SMT enabled vs. disabled. (With plenty of RAM, and sources hot in disk cache, and already plenty of parallelism in the no-SMT build, since that's a huge number of physical cores.) Zen 4C cores have less cache per core than full cores like on your system, so HT might be worse. Also there are so many physical cores that it might be a lot compared to the number of source files per directory. — Peter Cordes, Jul 27 '23 at 20:02

score 2 · Answer 9 · answered Jul 20 '12 at 05:26

From my experience, there must be some performance benefits when adding extra jobs. It is simply because disk I/O is one of the bottle necks besides CPU. However it is not easy to decide on the number of extra jobs as it is highly inter-connected with the number of cores and types of the disk being used.

score 1 · Answer 10 · answered Sep 09 '20 at 19:16

YES! On my 3950x, I run -j32 and it saves hours of compile time! I can still watch youtube, browse the web, etc. during compile without any difference. The processor isn't always pegged even with a 1TB 970 PRO nvme or 1TB Auros Gen4 nvme and 64GB of 3200C14. Even when it is, I don't notice UI wise. I plan on testing with -j48 in the near future on some big upcoming projects. I expect, as you probably do, to see some impressive improvement. Those still with a quad-core might not get the same gains....

Linus himself just upgraded to a 3970x and you can bet your bottom dollar, he is at least running -j64.

GNU make: should the number of jobs equal the number of CPU cores in a system?

10 Answers10

Linked