Is it possible to get better performance than a compiler by fully comprehending modern pc architecture?

Question

I know a lot of compilers nowadays are very good at optimizing the code. However, if a person who fully comprehends the modern pc architecture, is it possible to make the code faster than the compilers do? Like, what if he write the code with 100% assembly, focusing on the architecture? And if it does make a difference, is it worthwhile?

Some times it is possible. However, it's very hard. The best way to beat the compiler is to improve the program instead of chasing minuscule gains in performance from optimising assembly. — fuz, Jul 19 '20 at 15:17
Yes but understand a lot of it today is not just the processor, what is outside the processor plays a major role in the performance. Detailed documentation for the whole system, including experience on the x86 is not readily available. — old_timer, Jul 19 '20 at 15:27
due to the nature of the pc (x86) world/history, making code that performs very well on your machine can/will be slower on another machine. for x86 you want to aim for a good generic average not tuned for a particular system or family. — old_timer, Jul 19 '20 at 15:29
It is not difficult to find places where you can improve compiler output for various reasons. So it doesnt take much work to take the compiler output and make it "better". — old_timer, Jul 19 '20 at 15:30
Yes, [C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](https://stackoverflow.com/a/40355466) has a section on beating the compiler for that small loop. It takes hours / days of human effort (vs. seconds for a compiler) to improve on, and benchmark to verify that it was a real improvement, and it's very inconvenient to use asm in practice, so it's rarely done. — Peter Cordes, Jul 20 '20 at 00:44

Jérôme Richard · Answer 1 · 2020-07-19T15:36:11.120

Yes! An experienced developer can clearly beat a compiler on specific tasks (given a relatively large amount of time).

One reason is that developers can have more information about a given task than the compiler (developers can experiment algorithms, have information about the data size, the possible inputs, the execution context of the program). Another reason is that compilers are not perfect (they use heuristics) and often fail to do high-level code transformations.

However, it is often sufficient to just provide hints to the compiler, tune compilation parameters, insert inline assembly or built-in calls rather than writing a full program in assembly.

A good example of this is the use of low-level processor instructions such as non-temporal instructions or SIMD instructions as well as bit-wise instructions. These instructions can often be generated from compilers with enough hints. In the case of the register allocation, it is possible for an expert of a target hardware to design a better assembly code (in this case compiler hints are not sufficient).

score 4 · Accepted Answer · answered Jul 19 '20 at 15:17

Sometimes the human can produce better code, if some of these requirements are true:

The human needs specific knowledge about the targeted architecture.
The human knows all tricks from the compiler like (leftshift instead multiplication).
Furthermore the human will need to know a lot about assembly/processors, like pipeline stalls, cache misses, ...
The human will need a lot of time for non-trivial programs.

Like, what if he write the code with 100% assembly, focusing on the architecture?

This program will be really fast on this CPU, but you would have to rewrite it from scratch for every cpu. (Like you wrote for Processor-1 with a faster shr instruction, but Processor-2 has a faster div instruction.) Furthermore the development time will be significantly longer (Up to 20x high)==>Higher costs

And if it does make a difference, is it worthwhile?

Only in a small set of applications, like writing for a microcontroller, or if you really need pure performance (Data processing for data, that can't be done on the GPU).

For more information: When is assembly faster than C?

BUT: First use other algorithms, like using the Coppersmith–Winograd algorithm instead of the naive algorithm for matrix multiplication, as an example. Only if every other possibility is used, use assembly, otherwise you can end in a maintenace nightmare quite quickly.

Maintainability is a key point, IMO. Compilers can inline and do constant-propagation across a whole program, re-optimizing every call-site for a small function after a small change. Doing that by hand for asm would be a nightmare, so for a project written entirely in asm you'd probably have actually called some function (or used a macro) and have missed optimizations in some cases. i.e. compilers can quickly redo optimization to create asm that would unmaintainable by hand. Hand-written asm is only worth considering for isolated hot loops or blocks, not usually whole projects. — Peter Cordes, Jul 20 '20 at 00:48
The middle ground here is manual SIMD vectorization with intrinsics (e.g. Intel's https://software.intel.com/sites/landingpage/IntrinsicsGuide/). You can do things that compilers wouldn't create from normal scalar code, using hardware primitive operations. But the compiler still fills in the details like array index calculations. — Peter Cordes, Jul 20 '20 at 00:50

score 1 · Answer 3 · answered Jul 20 '20 at 08:18

Yes, humans coding in assembler can beat compilers. But in general, you better spend your precious time optimizing on a higher level.

Why can humans beat compilers?

Because compilers have been designed by humans with knowledge on the target architecture. So, with the same degree of knowledge, humans can produce assembly code that is at least as performant as the compilation result.

It can probably be even better, as a human developer can optimize for a given task, while the compiler can only apply generic optimizations.

Why is that a bad idea?

It's all about development cost.

Developing in assembler takes much, much more time than in a higher-level language, and reduces readability and maintainability.

In most situations, you will be better off investing the same amount of development time into high-level optimizations, e.g. better algorithms, local optimizations, all based on thorough profiling of the application to find the real bottlenecks.

With the budget needed for an assembly solution, you can even have two or three independent, competing teams developing their high-level solutions, and later have them combine their best ideas into a final version, and still have budget to further optimize that one.

At some point the "local optimization" may use intrinsics like `__builtin_popcountll` or `_mm_shuffle_epi8`. But usually it stops there, unless the hotspot is *very* important and the compiler does a bad job with your intrinsics, not compiling to the asm you want. (This is apparently still common with ARM SIMD, where compilers do much worse than with SIMD intrinsics for x86 or PowerPC). At that point, it's worth considering asm *for that one loop*. Not for the whole project of course, nobody does that these days, except for personal reasons / fun. (e.g. FASM is written in assembly.) — Peter Cordes, Jul 20 '20 at 10:48

Is it possible to get better performance than a compiler by fully comprehending modern pc architecture?

3 Answers3

Why can humans beat compilers?

Why is that a bad idea?

Linked