ASM faster with looping?

Question

is a code loop in assembler faster/ equal/ slower then just writing the instruction x-times, that are needed? Or is it code dependent? But when is the machine faster in executing the binary: if the 16-bit-cpu reads its 16 bits straight forward or goes 32 bits back?

And in the end; what means loop in asm?

In assembly everything is code-dependent and machine-dependent. In modern processors (eg. x86-64) the fastest way to do compute something requires finding the best compromise between keeping all cores and all CPU threads busy for all clocks, avoiding CPU stalls by using caches optimally (if the data fit in caches, that is), using vectorized instructions if the task permits etc. Things were simple for 486 and earlier processors when only parallel execution was CPU/FPU parallelism. Nowadays, for the fastest code, you need to take care that all of several cores do something useful every clock. — nrz, Mar 06 '13 at 00:38
So asm loops can barely be processes if the amount of instructions is greater then a loop instruction is able to handle (to jump back)? -To say, if my file shall not be more then 512 bytes, then i can't make the loop-call at the last byte. ->Or if i would add 32bits and 32bits on a 16-bit-cpu,, the cpu must split this operation in to a loop automatically. Then a loop which adds 1bit 16 times is just as fast as adding 16 bits to zero/ empty register. A question then is also, what registers a loop instruction is allowed to use, when executing the code, and which ones can intersect or not. — k t, Mar 06 '13 at 17:58
First, it's not about processes, it's about processor cores. Second, `loop` is not necessarily the fastest instruction for looping. Third, you should to check http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle and Mysticial's excellent answer to it. — nrz, Mar 06 '13 at 18:07

score 6 · Accepted Answer · answered Mar 05 '13 at 23:20

It depends. It might be faster to have a number of instructions repeated. This technique is commonly known as loop unrolling. Not unrolled loop can also turn out to be more efficient because a the code will be smaller, and many CPUs are capable and can often recognize the loop pattern and predict it. It may also be possible to have a partially unrolled loop. For example, instead of executing 20 instructions straight or doing 20 loop iterations, one can do 5 loop iterations executing 4 instructions in each one.

Generally, it is hard to tell what is the best without knowing what architecture you are targeting (i.e. make and model of the CPU). That's why people don't really write assembly code a lot — analyzing pros and cons of different approaches, cost of execution, and generating different code for different CPU makes and models is something that compiler developers do. Others then write code in their language of choice, and compiler generates the best possible assembly for a target platform, which works out in 99% of cases.

To answer your question, you would probably either write both versions yourself and profile them to see which one wins. Alternatively, you may write the code in C and turn on optimizations for your platform (i.e. use -O3, -march switches) and see what compiler generates — it surely does the right thing.

Hope it helps. Good Luck!

Agreed, it strongly depends. Unrolling loops is beneficial only if the unrolled code doesn't blow the sizes of the various levels of instruction caching the CPU does. That'll be so at least if the penalty for cache misses is higher than that for branches / branch prediction misses. It's often beneficial to partially unroll, as you've said - so that e.g. one iteration processes at least one cacheline's worth of data, or one "full row" of vector registers. Partially-unrolled loops also often provide more chances of "latency hiding" (i.e. intermix load/store ops with ALU ops) than tight loops. — FrankH., Mar 06 '13 at 14:07

score 1 · Answer 2 · edited May 23 '17 at 12:21

loop in asm usually means branch-if-equal or something similar. if you're using a HLE and is not working on a platform that does the compare and branch in one instruction, it might be a pseudo-instruction for something equivalent to an x86 cmpl and then an je.

MIPS branch instructions: http://en.wikibooks.org/wiki/MIPS_Assembly/Control_Flow_Instructions#Branch_Instructions

Also check out the questions:

ASM faster with looping?

2 Answers2