When should I use ASM calls?

Question

I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...) So I've been thinking about how to tackle this situation best so that it would run smoothly.

(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)

Is it worth it to learn how to work with ASM, so I can make ASM calls in C++, can it give me a significant/notable performance advantage?

In what situations should I use it?

Note that writing small sections of inline assembly might not be able to beat the compiler at all, because it will interfere with the compiler's ability to perform various optimisations on the surrounding code. — Mankarse, Dec 13 '11 at 14:43

Oliver Charlesworth · Accepted Answer · 2011-12-13T12:23:20.180

14

Almost never:

You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.

edited Dec 13 '11 at 12:23

answered Dec 13 '11 at 12:11

Oliver Charlesworth

267,707
33
569
680

2

+1 unless you **know** that you need asm, you don't need it :) – Martin Dec 13 '11 at 12:19
@Martin That's what the question is... When do I need asm, in other words, how do I know when I need it... – xcrypt Dec 13 '11 at 12:21
And for my application, I'm pretty sure I will have to use *something*, because as of the current plan, there's no way in hell it would run smoothly with the options I currently have/know about ;) – xcrypt Dec 13 '11 at 12:31
1

You'd need to know about not just the x86 architecture but also that of the graphics system. Since most cards are designed for DirectX and DirectX is designed for them it would make no sense to go underneath that library. Forgive me if you'd prefer another gfx lib. But I know what you mean about finding the processor cycles so follow Luchian's advice and profile for bottlenecks in C++ first. – John Dec 13 '11 at 13:09
@xcrypt: You haven't got that quite right. What Martin says is: You don't know if you need asm, therefore you don't need it. And he is quite right. And Olis answer is to the point too. – Gunther Piez Dec 13 '11 at 15:53
@drhirsch That just seems quite contra-productive for the people actually considering to learn ASM, as well as being slightly offensive. It's not because I don't know now, that I won't at some point later in time. – xcrypt Dec 13 '11 at 17:51
Also Oli's answer has nothing to do with that ridiculous statement, it's an answer that gets to the point. – xcrypt Dec 13 '11 at 18:38
2

@xcrypt: sorry, I never meant to be offensive, it was intended as a humorous way to say that in no programming situation would you need to use assembly for performance reasons. Assembly could be useful if you are writing an OS, device driver, BIOS, firmware, that kind of stuff - in an environment where a `main()` method makes no sense. Even if you were to outperform the compiler (which is **very** hard) it would only be by a tiny margin. Real performance gain comes from optimizing algorithms, i.e. cache results, avoid calculating data that is not needed, etc – Martin Dec 14 '11 at 16:43
@xcrypt: I would still encourage you to learn assembly because it's interesting to see what happens "under the covers", it's just that with modern compilers in mind, performance is not really a reason to get started with asm – Martin Dec 14 '11 at 22:28

score 4 · Answer 2 · answered Dec 13 '11 at 12:12

If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.

Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

3

Short answer: it depends, most likely you won't need it.

Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.

Do some profiling.

You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.

My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.

Profile profile profile....

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 13 '11 at 12:15

Luchian Grigore

253,575
64
457
625

So what would you say, are the other benefits of learning ASM? – xcrypt Dec 13 '11 at 12:28
1

@xcrypt of course... learning anything has its benefits... you'd get a better understanding of how things work "under the hood", but in this particular case, it wouldn't be of that much help (other than conceptual understanding of course). – Luchian Grigore Dec 13 '11 at 12:30
And, what exactly do you mean by separating logical sections into modules? – xcrypt Dec 13 '11 at 12:42
1

@xcrypt well, for example, you could put all your geometry (or whatever) in a separate module and wherever you use intensive geometrical computations, use methods from that module. That way, if optimization is required, there's only one place you need to make your changes. – Luchian Grigore Dec 13 '11 at 12:48
1

Occasionally, things are just easier in ASM too. Not the norm by any means, but when manipulating hardware directly, sometimes it's just easier to do (and to read)... – Brian Knoblauch Dec 13 '11 at 13:54

score 3 · Answer 4 · answered Dec 13 '11 at 12:16

A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.

However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.

score 2 · Answer 5 · edited May 23 '17 at 11:55

2

Don't get ahead of yourself.

I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).

This was not done by assuming in advance what needed to be made fast.

It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler. Rather I rely on random pausing, a method known and used to good effect by some programmers.

It proceeds through a series of iterations. In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.

As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest). That's how you get major speedup.

If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.

P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time. What they miss is problems that are diffuse. For example, you could have 100 routines, each taking 1% of time. That is, no bottlenecks. However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all. Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples. In other words, if you took 9 samples, on average you would notice the activity on 3 of them. That tells you it's big. So you can fix it and get your 3/2 speedup ratio.

edited May 23 '17 at 11:55

Community

1
1

answered Dec 13 '11 at 13:57

Mike Dunlavey

40,059
14
91
135

There are profilers (like oprofile), which do exact this type of statistical profiling. Modern hardware does support this too - the CPU_CLK_UNHALTED counter on intel hardware does _exactly_ what you describe, take a snapshot of the stack and the instruction pointer every _n_ cycles and look where it spends most of its time – Gunther Piez Dec 13 '11 at 16:01
@drhirsch: Yeah I know they do. You might want to re-read the last paragraph and/or look at that sourceforge project. If they did exactly the same thing, then you would see similar speedup results. – Mike Dunlavey Dec 13 '11 at 16:29
700x? Sounds pretty hardcore ;) This will def. help me, thanks! – xcrypt Dec 13 '11 at 18:20
@xcrypt: Those are the numbers. Every app is different. If I/O is a necessary part of the process, you can't go any faster than that. But you can't say ahead of time what you'll uncover. Sometimes the I/O *isn't* necessary, like I see an app spending fully 1/2 of its time reading strings from resources in dlls, just so they can be displayed, so as to show the user why it's taking so long! – Mike Dunlavey Dec 13 '11 at 20:16
@xcrypt: What often happens is you can clearly see that you can get a healthy speedup ratio, provided you are willing to replace a cherished data structure A with another one B. Not only that, on a later iteration, you may need to replace B with C. In my experience, this gets a lot of resistance. Thus, speedups don't happen. – Mike Dunlavey Dec 13 '11 at 23:35
@drhirsch: I would love it if somebody would go to that sf project, extract just version cim2 (it's not very big), ignore everything else in there, and optimize it using oprofile or whatever they prefer. See how fast they can make it go, no holds barred, working on it as long as they want, and compare speeds before and after. If they get anywhere near the speed ratios shown in that project, then their skill needs to be taught to others. – Mike Dunlavey Dec 15 '11 at 19:13
@MikeDunlavey: I downloaded it, and in my initial run with cim2 I got 27 µs per job, a hundred time faster than on your box. I am on linux, using gcc, and had to remove some code like the `CWinApp` constructor. I don't know if we can draw some meaningful conclusions. – Gunther Piez Dec 15 '11 at 20:10
@drhirsch: I think it just means 1) your machine is faster (mine is a few years old), 2) you probably had optimizing turned on. There is a manifest constant that sets the number of jobs, so you can bump it up by a factor of 10 or 100 if you need to. It's the time/job that needs to be optimized. – Mike Dunlavey Dec 15 '11 at 21:41
Yes, I already bumped it up by a factor of 10. And of course I do turn on compiler optimization - this topic is about optimization. And reading your changelog, a large part of my speedup is because I accidently turned on another "optimization": I supressed the output - I didn't want 10k lines scrolling up in my terminal and also I didn't want to measure the speed of cout. I didn't need neither oprofile nor random stopping for this. But this doesn't hardly count as optimization in my book, it was more like fixing a bug :-) – Gunther Piez Dec 15 '11 at 22:47
@drhirsch: Yeah. I appreciate your taking a look at it. I tend to leave debug on because if I turn it off I lose visibility into what the program's doing. Turning off cout was the last thing I did, and it only accounted for 6% of the original time, so there's still around 80-90% of time could be eliminated. **Correction** 99%. – Mike Dunlavey Dec 15 '11 at 23:15
I have had a look over it and decided not to try it :-) Mainly because I believe there are many constructs which could be optimized by looking at it, try to understand what it does, and rewriting - long before any profiling tools are being used. Further more any meaningful comparisons seem to be not possible - gcc's std::vector implementation seems to be far more efficient than MSVC's, my first run was 100x faster than on your box, which leaves only a factor 7x to be optimized. – Gunther Piez Dec 23 '11 at 08:19
But I still believe that your method of random pausing, which is essentially a slow version of oprofile's CPU_CLK_UNHALTED events feels only superior because of the lack of an easy-to-use implementation of it (or because of the superior integeration of your debugger). Consider a tool which does twenty times "random pausing", records and counts each stack frame/function for you and presents the results in an readable form to you, but you don't need 20 times stopping, restarting and manually counting, but only one command (or mouse click). Wouldn't this be superior in any case to random pausing? – Gunther Piez Dec 23 '11 at 08:27
I you haven't already seen it, look at http://stackoverflow.com/questions/8416395/possible-optimization-in-my-code , where I looked at the code, found it very strange, did not understand it and so did a complete rewrite with a completely different approach. Resulting "optimization factor" was several thousands. For a 10 lines code project this was done in 15 minutes, but to be honest, for your sourceforge project I don't have the patience :-) – Gunther Piez Dec 23 '11 at 08:49
@drhirsch: I do have a hard time getting people to take the issue seriously. Regardless, I appreciate your being willing to consider it, and if your box and compiler are so good it's not an issue, I understand. I also agree that automatic sampling is obviously much easier, assuming you can control it, so as to avoid diluting with irrelevant samples. Where the road takes a very sharp bend is in the span between the information that the samples contain and how much of it arrives in your brain. The kind of summarization that the tools do is where the most valuable knowledge is just summed out. – Mike Dunlavey Dec 23 '11 at 15:59
@drhirsch: So, it is tempting to say the program is just badly written or on bad hardware, which is just a distraction from the main point. Real performance problems 1) have reasons, and 2) don't necessarily localize in particular places in the code. That's what summarization misses, resulting in big speedup factors being missed. If on my slow machine the code is only 7 times faster than on your fast box with super compiler, that's the point I'm trying to make. – Mike Dunlavey Dec 23 '11 at 16:19
@MikeDunlavey I agree, the presentation of the data is of high importance. But the conclusion I draw is as long as people are more willing to do random pausing, the integration/data presentation of sampling profilers is just not good enough. Or the other way around, as long as oprofile isn't easy enough to use or the profile data isn't easy enough to interpret, random pausing is a valid sampling method. A statistical profiler should be able to do everything you do with random pausing even better (or least as good) – Gunther Piez Dec 23 '11 at 16:28
Even not clearly localized speed problems (like 'a lot seems to happen in std::vector') can be detected by a sampling profiler, it's a question how the data is summarized. You can look at the time spend per line in the code, but you can look at time per function or even library too. – Gunther Piez Dec 23 '11 at 16:35
@drhirsch: Right. A good fraction of my income over 30 years has been solving serious performance problems. People say (effectively) "Oh he's just *magic*, not quite human". Here on SO & in some articles I do my level best to tell how to do the magic, and I hear (effectively) "Oh there's that *kook* again". Anyway, you're right that if profilers only showed you the actual info in samples, they would be much more powerful. (There's still the issue that samples alone are often not enough.) Regardless, you can have the last word, if you want it. – Mike Dunlavey Dec 23 '11 at 21:42

score 1 · Answer 6 · answered Dec 13 '11 at 12:16

"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.

score 1 · Answer 7 · answered Dec 13 '11 at 14:13

1

If you need to ask than you don't need it.

answered Dec 13 '11 at 14:13

Casey

12,070
18
71
107

If you can't spell nor give a productive answer, *then* you might not want to waste your time posting it. – xcrypt Dec 13 '11 at 18:00
I thinks it perfectly valid. If someone isn't experiences enough to know when they might need it, than they shouldn't worry about it. And it looks like others agree too. – Casey Dec 13 '11 at 19:13
1

although he is asking specifically in which situations he would want to use it and not if. So I'll give you that. – Casey Dec 13 '11 at 19:14

When should I use ASM calls?

7 Answers7

Do some profiling.