While we're at it, MinGW also has 64-bit inline assembly language; and it's pretty fast, and free. It used to be slow on some math; so I'd start out comparing performances of MSVC vs. MinGW to see if its a decent starting place for your application.
Also, as to hand-coded assembly being slower:
- Actually, humans very often do code assembly that runs more efficiently than compilers - or at least that was always the common wisdom when I was learning programming in the 70's and 80's and continued to be the case through ~2000.
- You can always code it in "C" or C++, compile that to assembly, and tweak it to see if you can improve that. That way, you can learn from optimizations; and see if you can improve on them.
Assembly very much can have a place in code that needs high optimization, no matter what M$ says. You won't really know if assembly will or won't speed up code until you try it. Everything else is just pontificating.
As above, I favor the approach of compiling c++ code into assembly, and then hand-optimizing that. It saves you the trouble of writing much of it; and with a little experimentation, you may get something that tests out faster. FWIW, I've never needed to with a modern program. Often, other things can speed it up just as much or more - e.g. such as multi-threading, using look-up tables, moving time-expensive operations out of loops, using static analyzers, using real-time analyzers such as valgrind (if you're on Linux), etc. However, for performance-critical applications, I see no reason not to try; and just use it if it works. M$ is just being lazy by dropping inline assembly.
As to is 64-bit or 32-bit faster, this is similar to the situation with 16-bit vs. 32-bit. The wider bandwidth can sling huge amounts of data faster. If both run on a 64-bit OS, they run at exactly the same clock speed; so the 32-bit program shouldn't be faster. Yet, I've observed the CPU clock on 32-bit Win7 to run slightly faster than 64-bit Win7. Thus for the same number of threads, and for more CPU intensive operations, a 32-bit app on 32-bit Win7 would be faster. However, the difference isn't much; and 64-bit instructions can really make a difference. However, a given user will only have one OS installed; and so the 64-bit app will be either faster for that OS; or at best the same speed if running a 32-bit app on a 64-bit OS. It will be a larger download, however. You might as well go for the possibly faster speed with 64-bits; unless you are dealing with a dedicated system running code you know won't be moving large amounts of data.
Also, note that I benchmarked a 64-bit and a 32-bit app on OSs of the respective sizes, using the respective versions of MinGW. It did a lot of 64-bit floating point number crunching, and I was sure the 64-bit version would have the edge. It didn't!! My guess is that the floating point registers in the built-in math coprocessor run in equal numbers of clock cycles on both OSs, and perhaps slightly slower on 64-bit Win7. My benchmarks were so close in both versions, that one was not clearly faster. Perhaps long number-crunching operations were slower on 64-bit, but the 64-bit program code ran a little faster - causing nearly equal results.
Basically, the only time 32-bits makes sense, IMHO, is when you think you might have an in-house app that would run faster on a 32-bit OS; you want a really small executable, or when you are delivering to users on 32-bit OS machines (many developers still offer both versions), or a 32-bit embedded system.
Edited to reflect that some of my remarks pertain to my specific experience with Win7 x86 vs. x64.