Can recasting stuff in assembler make your program faster ? Yes. Significantly faster ? That depends on where the bottle-neck is.
With your modern processor, saving instructions does not necessarily save processing time. Scheduling operations to make best use of overlapping execution may do better, even if more instructions are involved. The rules are complicated and not (in my experience) well documented, and vary from processor to processor... and are probably better suited to machine generation of instructions than your human programmer. The processor is built to run machine generated code ! Cleaner, hand-crafted code may look prettier, but may not run any faster :-(
For small fragments of critical code, a human can be better at making use of special purpose instructions in ways particularly well suited to the special needs of the task. A human can also do better where they can take advantage of special properties of the problem. And in assembler the human may be able to push even general purpose instructions to get more out of them. Working with the branch predictor can help, and the human can know more about what the code is going to do, mare than the compiler can deduce from what's written. Similarly, the human may do better at dropping hints to the cache management for pre-reads etc. In short, the human can (still) do better in specialised areas where general purpose code generation cannot be expected to produce the best result.
In larger pieces of code, the human may do better by not being bound by the ABI. The human may be able to allocate key infomation to registers across many functions, and have some functions take their arguments and return results in ways which are convenient for the callers, and which don't require shuffling things around all the time between calls. Also, the human may be better at allocating stuff in memory to help the cache, given a better global view of the problem. In short, the human can (still) do better armed with a wider view of the problem.
However, none of this is going to come cheap ! And it may be necessary to try more than one approach to hand-optimising the code, and some careful measurement to ensure it is indeed better.
Of course, this is all assuming you are writing for a "big" processor -- which you didn't specify. If you are writing for an itty-bitty PIC (say), older rules apply.
And, of course, the oldest rules of all when it comes to code optimisation:
- don't do it: find a better algorithm
- don't do it: find a better data structure
- don't do it: repeat (1) and (2)
- don't do it... unless you have a piece of code which is critical to the running time... and even then only optimise the bit(s) that matter.
Believe me, as an assembler programmer it pains me to say these things ! But you need a particular sort of problem to make it worth the time and effort involved in carefully crafting effective assembler code.