64bit Applications and Inline Assembly

Question

I am using Visual C++ 2010 developing 32bit windows applications. There is something I really want to use inline assembly. But I just realized that visual C++ does not support inline assembly in 64bit applications. So porting to 64bit in the future is a big issue.

I have no idea how 64bit applications are different from 32bit applications. Is there a chance that 32bit applications will ALL have to be upgraded to 64bit in the future? I heard that 64bit CPUs have more registers. Since performance is not a concern for my applications, using these extra registers is not a concern to me. Are there any other reasons that a 32bit application needs to be upgraded to 64bit? Would a 64 bit application process things differently when compared with a 32bit application, apart from that the 64bit applications may use registers or instructions that are unique to 64bit CPUs?

My application needs to interact with other OS components e.g. drivers, which i know must be 64bit in 64bit windows. Would my 32bit application compatible with them?

`Since performance is not a concern for my applications`, why on Earth are you using inline assembly then? — Hans Passant, May 29 '11 at 14:06
@Hans Passant: the other common reason is programs that create code dynamically, e.g. virtual machine runtimes/JIT compilers. They can benefit from being able to rewrite code, which is a bit harder if you don't know which code was there before. But people writing such VMs would probably not need to ask this question - the statement "compilers write better assembly than humans" doesn't really apply to humans who write compilers ;) — MSalters, May 30 '11 at 08:48
Also see [Making assembly function inline in x64 Visual Studio](https://stackoverflow.com/q/41208105/608639). If you can get the Visual Studio compiler/linker to inline the freestanding ASM, then it is not a big loss in practice. And maybe [How to do a naked function and inline assembler in x64 Visual C++](https://stackoverflow.com/q/26637755/608639). — jww, Apr 25 '19 at 02:22
@MSalters *the statement "compilers write better assembly than humans" doesn't really apply to humans who write compilers* It actually can apply, at least sometimes. For comparison, humans who write artificial intelligence for playing chess may not be able to beat their own AI in playing chess. — WhatsUp, Mar 30 '22 at 20:03

score 15 · Accepted Answer · edited Apr 25 '19 at 02:33

15

Visual C++ does not support inline assembly for x64 (or ARM) processors, because generally using inline assembly is a bad idea.

Usually compilers produce better assembly than humans.
Even if you can produce better assembly than the compiler, using inline assembly generally defeats code optimizers of any type. Sure, your bit of hand optimized code might be faster, but the fact that code around it can't be optimized will generally lead to a slower overall program.
Compiler intrinsics are available from pretty much every major compiler that let you access advanced CPU features (e.g. SSE) in a manner that's consistent with the C and C++ languages, and does not defeat the optimizer.

I am wondering would there be a chance that 32bit applications will ALL have to be upgraded to 64bit in the future.

That depends on your target audience. If you're targeting servers, then yes, it's reasonable to allow users to not install the WOW64 subsystem because it's a server -- you know it'll probably not be running too much 32 bit code. I believe Windows Server 2008 R2 already allows this as an option if you install it as a "server core" instance.

Since performance is not a concern for my appli so using the extra 64bit registers is not a concern to me. Is there any other reasons that a 32bit appli has to be upgraded to 64bit in the future?

64 bit has nothing to do with registers. It has to do with size of addressable virtual memory.

Would a 64 bit app process different from a 32bit appl process apart from that the 64bit appli is using some registers/instructions that is unique to 64bit CPUs?

Most likely. 32 bit applications are constrained in that they can't map things more than ~2GB into memory at once. 64 bit applications don't have that problem. Even if they're not using more than 4GB of physical memory, being able to address more than 4GB of virtual memory is helpful for mapping files on disk into memory and similar.

My application needs to interact with other OS components e.g. drivers, which i know must be 64bit in 64bit windows. Would my 32bit application compatible with them?

That depends entirely on how you're communicating with those drivers. If it's through something like a "named file interface" then your app could stay as 32 bit. If you try to do something like shared memory (Yikes! Shared memory accessible from user mode with a driver?!?) then you're going to have to build your app as 64 bit.

edited Apr 25 '19 at 02:33

jww

97,681
90
411
885

answered May 29 '11 at 07:46

Billy ONeal

104,103
58
317
552

3

One correction - 32bit apps can map ~4GB - as the 32bits declare :) It's depending on the OS that some additional limits come up - on 32bit Windows by default you get 2GB but can get 3GB using a boot switch. I think on most 64bit systems the apps can get full 4GB. ALso a 32Bit os can address more then 4GB of physical memory: http://msdn.microsoft.com/en-us/library/aa366778(v=vs.85).aspx#physical_memory_limits_windows_server_2003 - WS 2003 x86 == 64 GB – RnR May 29 '11 at 10:35
17

I disagree with you saying 64 bit has nothing to do with registers. X64 has twice the registers as X86. – Boofhead May 29 '11 at 11:11
@RnR: Even though 3GB of address space is possible, an application won't be able to map all that at once. A mapping needs to be contiguous -- if nothing else the program's code itself would be limiting in those cases. Even with with the `/3GB` switch I doubt you could get much more than 2GB mapped. And I'm not talking about physical memory at all, so the PAE argument doesn't really apply here -- PAE doesn't help virtual memory, only physical RAM. On a 64 bit system, apps can get a hell of a lot more than 4GB -- they've got a 2EiB address space to play with! – Billy ONeal May 29 '11 at 22:53
3

@Boofhead: Yes, x64 does have more registers. But the more registers are not the reason people are moving to 64 bit. People are moving to x64 for address space reasons. The registers are just a bonus. – Billy ONeal May 29 '11 at 22:54
I see such apps using more then 2GB quite often - sometimes you don't need 10000000x more memory and that additional 1GB will be more then enough - anyway - just wanted to clarify this part of the answer "for future use" as it's overall a very good one :) – RnR Jun 06 '11 at 12:32
1

Not sure how it works with Windows, but on MacOS X / iOS you get a major advantage if _all_ applications have the same bitness. The reasons for that most likely apply to Windows as well. So being the only 32 bit app among lots of 64 bit apps hurts the whole system (basically because it forces all 32 bit libraries into memory in addition to all 64 bit libraries). I suppose the day where 32 bit is not allowed is 5-10 years away. The day where customers hate you or think you are incompetent for being 32 bit is much nearer. – gnasher729 Mar 22 '14 at 13:14
70

I'm really sorry for bumping this, but... whoever (at Microsoft) decided to not include inline assembly in x64 because it's "generally a bad idea" should be shot on sight. I'm the programmer, let me face the consequences of the bad code you (Microsoft) suppose I'm going to write. – rev Nov 04 '14 at 20:30
2

@Acid: Complain to the people who own the compiler, not me. Although I happen to agree with the compiler folks here. If you want to use assembly, then just write an assembly file. I'm happy that they're not spending development time implementing a frontend feature that 99.999% of people should not be using when a perfectly fine workaround exists. – Billy ONeal Nov 04 '14 at 20:33
9

They will definitely lose a customer when I switch to x64. It's my app, and for what I do, I *need* inline assembly (I mainly do reverse engineering). So one more GCC customer, one less MSVC customer. Just my 5 cents. – rev Nov 04 '14 at 21:09
3

@AcidShout: Good thing MSFT makes all that money from that compiler they give away for free :) – Billy ONeal Nov 04 '14 at 21:10
13

there are other reasons besides speed for inline assembly – Brennan Vincent Nov 30 '16 at 18:32
3

*"... because generally using inline assembly is a bad idea"* - citation, please. The only downside Microsoft lists at [Inline Assembler](https://github.com/MicrosoftDocs/cpp-docs/blob/master/docs/assembler/inline/inline-assembler.md) is lack of portability. In fact, Microsoft lists several benefits of inline assembly at [Advantages of Inline Assembly](https://github.com/MicrosoftDocs/cpp-docs/blob/master/docs/assembler/inline/advantages-of-inline-assembly.md). – jww Apr 25 '19 at 02:27
1

@rev: The real reason MSVC dropped support for inline asm is that MSVC's implementation was clunky, brittle, and unsafe for functions with register args!!! comment: [What is the difference between 'asm', '\_\_asm' and '\_\_asm\_\_'?](//stackoverflow.com/posts/comments/59576185). MS gives the ridiculous explanation (https://docs.microsoft.com/en-us/cpp/assembler/inline/using-and-preserving-registers-in-inline-assembly?view=vs-2019) that "a function has no way to tell which parameter is in which register", except obviously it has to know where to find them to emit its own code for C statements. – Peter Cordes Apr 25 '19 at 17:08
And yes, if you want to use inline asm, MSVC's entire syntax / design was always inefficient, making it impossible to get data into an `_asm` statement without going through memory. GNU C inline asm syntax is much more complicated, but (if you understand the design) you know exactly what you're getting and can defeat fewer optimizations. (Still obviously destroys constant-propagation and value-range info, so https://gcc.gnu.org/wiki/DontUseInlineAsm unless you have to, especially if the only motivation is performance. Usually better to use intrinsics and/or tweak the C++ source.) – Peter Cordes Apr 25 '19 at 17:12
"Usually compilers produce better assembly than humans." Compilers are terrible at generating assembly. That is why critical parts are optimized in assembly by hand. And God knows how much CO2 is emitted daily because of this. – Siavoshkc Apr 14 '21 at 21:41

score 12 · Answer 2 · edited Jul 13 '20 at 20:59

12

Apart form @Billy's great write up, if you really feel the need to use 64bit assembly, then you can use an external assembler like MASM to get that done, see this. (its also possible to speed this up with prebuild scripts).

edited Jul 13 '20 at 20:59

Joseph Sible-Reinstate Monica

45,431
5
48
98

answered May 29 '11 at 09:43

Necrolis

25,836
3
63
101

score 7 · Answer 3 · answered Jan 17 '15 at 22:19

7

the Intel C Compiler 15 has inline capability in 64bit too. And you could integrate the IC in Visual Studio as a toolset: then you'd have VC++ 64bit with inline assembly. One catch though -its expensive cheers

answered Jan 17 '15 at 22:19

Silvio

71
1
2

CodeLurker · Answer 4 · 2020-11-03T22:59:38.287

3

While we're at it, MinGW also has 64-bit inline assembly language; and it's pretty fast, and free. It used to be slow on some math; so I'd start out comparing performances of MSVC vs. MinGW to see if its a decent starting place for your application.

Also, as to hand-coded assembly being slower:

Actually, humans very often do code assembly that runs more efficiently than compilers - or at least that was always the common wisdom when I was learning programming in the 70's and 80's and continued to be the case through ~2000.
You can always code it in "C" or C++, compile that to assembly, and tweak it to see if you can improve that. That way, you can learn from optimizations; and see if you can improve on them.

Assembly very much can have a place in code that needs high optimization, no matter what M$ says. You won't really know if assembly will or won't speed up code until you try it. Everything else is just pontificating.

As above, I favor the approach of compiling c++ code into assembly, and then hand-optimizing that. It saves you the trouble of writing much of it; and with a little experimentation, you may get something that tests out faster. FWIW, I've never needed to with a modern program. Often, other things can speed it up just as much or more - e.g. such as multi-threading, using look-up tables, moving time-expensive operations out of loops, using static analyzers, using real-time analyzers such as valgrind (if you're on Linux), etc. However, for performance-critical applications, I see no reason not to try; and just use it if it works. M$ is just being lazy by dropping inline assembly.

As to is 64-bit or 32-bit faster, this is similar to the situation with 16-bit vs. 32-bit. The wider bandwidth can sling huge amounts of data faster. If both run on a 64-bit OS, they run at exactly the same clock speed; so the 32-bit program shouldn't be faster. Yet, I've observed the CPU clock on 32-bit Win7 to run slightly faster than 64-bit Win7. Thus for the same number of threads, and for more CPU intensive operations, a 32-bit app on 32-bit Win7 would be faster. However, the difference isn't much; and 64-bit instructions can really make a difference. However, a given user will only have one OS installed; and so the 64-bit app will be either faster for that OS; or at best the same speed if running a 32-bit app on a 64-bit OS. It will be a larger download, however. You might as well go for the possibly faster speed with 64-bits; unless you are dealing with a dedicated system running code you know won't be moving large amounts of data.

Also, note that I benchmarked a 64-bit and a 32-bit app on OSs of the respective sizes, using the respective versions of MinGW. It did a lot of 64-bit floating point number crunching, and I was sure the 64-bit version would have the edge. It didn't!! My guess is that the floating point registers in the built-in math coprocessor run in equal numbers of clock cycles on both OSs, and perhaps slightly slower on 64-bit Win7. My benchmarks were so close in both versions, that one was not clearly faster. Perhaps long number-crunching operations were slower on 64-bit, but the 64-bit program code ran a little faster - causing nearly equal results.

Basically, the only time 32-bits makes sense, IMHO, is when you think you might have an in-house app that would run faster on a 32-bit OS; you want a really small executable, or when you are delivering to users on 32-bit OS machines (many developers still offer both versions), or a 32-bit embedded system.

Edited to reflect that some of my remarks pertain to my specific experience with Win7 x86 vs. x64.

edited Nov 03 '20 at 22:59

answered Apr 26 '19 at 16:38

CodeLurker

1,248
13
22

1

Compilers are *much* better than they were in the 80s. Constant-propagation after inlining often allows simplifications that inline asm would defeat. Also, modern superscalar out-of-order CPUs are better compiler targets (especially x86-64 with its 16 registers vs. 8 in 32-bit mode is a nice improvement), and the things that slow CPUs down have become more obscure. But compilers are still *far* from perfect. – Peter Cordes Apr 26 '19 at 18:59
1

It's very easy to write code that's slower than a compiler ([C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](//stackoverflow.com/q/40354978)), but yes start with compiler output and benchmarking your changes normally avoids that danger, at least for the microarchitectures you test on. If you're familiar with Agner Fog's microarch PDF (https://agner.org/optimize) and instruction tables for a range of modern CPUs, then sure have a go at beating the compiler if you really want to. – Peter Cordes Apr 26 '19 at 19:02
Hand-written code tuned for one uarch might not be perfect for a future uarch, and in theory a compiler 10 years down the road with `-march=native` on some future CPU could do better. So make sure you maintain a decent C version, for testing and portability as well as to test against compiler-generated asm on future CPUs. – Peter Cordes Apr 26 '19 at 19:03
I can't say that I disagree with any of that; but I am pointing out that defeating optimizations can be a much smaller issue than improving the performance of CPU Intensive code, depending on the situation. After all, "C++ code for testing the Collatz conjecture faster than hand-written assembly - why?" does indicate some ways humans can think of to optimize assembly that the compiler didn't; e.g. in "Beating the Compiler". I guess we are saying inline assembly can be done with some effort; but make sure you're really beating it in the larger context, and keep around a C++ version. – CodeLurker Apr 29 '19 at 10:58
Yes, exactly. And to point out how much extra knowledge is needed beyond what's needed to write asm that's merely correct, most people are unlikely to beat the compiler. But sure, if you do have compiler-developer levels of asm tuning knowledge, then sure it could be worth the extra development and maintenance time to use inline asm if you can't hand-hold your current C compiler into emitting good asm. And you can benchmark on all the uarches you care about to make sure you didn't create a problem on one you're not testing on. – Peter Cordes Apr 29 '19 at 17:20
2

*the CPU clock on 32-bit OSs runs faster than on 64-bit ones.* I've never heard this claim before for either Intel or AMD CPUs, or seen any evidence of it. Max turbo clocks are not limited by running in long mode (full 64-bit mode or compat 32-bit user-space under 64-bit kernel) instead of legacy (pure 32-bit) mode. Neither Agner Fog's optimization guide or microarch guide (https://agner.org/optimize/) nor Intel's own optimization manual says anything about any kind of effect like that. – Peter Cordes Nov 30 '19 at 09:40
If you have a Skylake or later, modern OSes hand off control of clock speed (P-states) to the HW. Otherwise different OSes might possibly be slower to ramp to the highest P state (which lets the HW go to max turbo if power and thermal budget allows). But that would be a configuration detail, not a fundamental difference. Make sure to let the CPU "warm up" to max clock speed before benchmarking. I'm not going to believe an extraordinary claim like that without a lot more details of how you did the benchmark and what data you're basing the conclusion on; maybe ask a new SO question. – Peter Cordes Nov 30 '19 at 09:42
The only time 32-bit code makes sense is when you have pointer-heavy data structures (so smaller pointers would save cache footprint / memory bandwidth and increase spatial locality), and you can't use an ILP32 ABI in 64-bit mode. (e.g. [Linux x32](https://en.wikipedia.org/wiki/X32_ABI); `gcc -mx32` is more like `-m64` than `-m32` in registers available and calling convention, only being like 32-bit code in terms of type width.) – Peter Cordes Nov 30 '19 at 09:45
_the CPU clock on 32-bit OSs runs faster than on 64-bit ones. I've never heard this claim before for either Intel or AMD CPUs, or seen any evidence of it_ I say this because I had a dual boot situation on win7 . The x64 partition ran at clock speeds a few tenths of a MHz lower than the x86 one. That was either in System Properties or hwinfo. I think it's because x64 code generates more heat. There might have been an auto BIOS setting; but I had set my own multiplier; and FSB speed was 100 MHz. You'd think by definition, they should come out the same, but they didn't. That was my experience. – CodeLurker Oct 26 '20 at 13:05
The actual core clock speed is only adjustable in steps of 100MHz or so, derived from a relatively stable quartz clock (with a tiny temp coefficient, and the external clock crystal probably changes temp very little). It's plausible the *average* clock speed over a whole computation could be lower with one build than another because of heat. But anything that claims to be measuring max clock speed is just seeing **measurement error** or different calibration of its timing mechanism between builds, if the difference is only fractions of a MHz. Likely not a *real* difference in clock speed. – Peter Cordes Oct 26 '20 at 13:06
I'm not convinced it's generally true that a 64-bit OS even *would* make more heat per unit time, i.e. would run hotter. Possible, depending on the software of course. Maybe even plausible if we're talking about 64-bit Windows, especially if some 32-bit processes are running. (instead of Linux or something where 32-bit is kind of a 2nd-class citizen because it's obsolete.) – Peter Cordes Oct 26 '20 at 13:10

64bit Applications and Inline Assembly

4 Answers4

Linked