Is it possible to make decisions in assembly without using `jump` and `goto` at all?

Question

In this question, some answers showed how it is possible to make decisions without using "if" statements, however I suspect this is possible because "if" is not the only statement that generates jump instructions.

Given a fixed compiled language (in example C++), can the generated assembly do some kind of decisions without using jump and goto instructions?

Please give an example alternative to a simple if/else statement that does not use such instructions in case of affirmative answer.

'conditional mov' instructions? Can be done with mov instructions: search "MOVfuscator" — ABuckau, Sep 02 '16 at 10:10
Likely because the down voter finds your question to be useless in the big picture. — David Hoelzer, Sep 02 '16 at 12:06
there are certainly things you can do to reduce the conditionals but you cant have a general purpose process operate without them. Even ARMs conditional execution is still a conditional, if this flag combination then go ahead and execute this instruction. it is still an if then else. You dont need 16 different conditional branches, you can actually get away with only having one, but you need at least one for a general purpose processor, in particular for C++ as you suggest. — old_timer, Sep 02 '16 at 13:52
This is not an SO worthy question, if you want to understand something like this take an application like gzip for example and in any pseudo code you want or even re-write it in C/C++ with no conditions, one long linear execution that performs 100% of the functionality and options of the normal gzip. If you can do that then we can make a processor that does that too. If there is a single program that works with todays processors that you cannot re-write without conditions, then there is your answer. — old_timer, Sep 02 '16 at 13:55
SO is for: I have this program it doesnt work, here is my code I think it should do this but it does that can you help. — old_timer, Sep 02 '16 at 13:56
The question is not useless in the big picture. Avoiding jumps, esp conditional jumps is an excellent way of speeding up code, quite relevant. — Johan, Sep 02 '16 at 14:01
@dwelch From the FAQ: `if your question generally covers… a practical, answerable problem that is unique to software development` than it's OK. Looks on topic to me. Also doesn't fall into the no-no's listed further down: http://stackoverflow.com/help/on-topic — Johan, Sep 02 '16 at 14:06
I dont see this landing in that category, it definitely falls into the primarily opinion based though. — old_timer, Sep 02 '16 at 14:49
@dwelch: It's kind of a mix of computer science (theory of computation) and actual branchless techniques in asm. I thought it was interesting, and came up with an idea to implement conditional branching by storing to the Interrupt Descriptor Table and conditionally dividing by 0. No sign of a `jmp` or `jcc` instruction :D. So it's a practical implementable (on x86) solution, as well as some CS theory discussion about the implications if you don't allow hacks like this. — Peter Cordes, Sep 02 '16 at 14:58
A pretty good practical example showing the avoidance of branches and conditions (not completely, but just trying to minimise) is the strategy of loop-unrolling that most modern compilers will do once there is a known constant number of iterations. — tofro, Sep 03 '16 at 16:35

rjp · Answer 1 · 2016-09-02T16:38:05.747

6

The ARM architecture has an interesting conditional execution trait. Running in full ARM mode, nearly every instruction can have a condition attached to it. These are the same conditions that are used on the Branch instructions. An instruction such as add r0, r0, #15 will always execute, but an instruction such as addeq r0, r0, #15 will only execute if the zero flag is set. An equivalent using branches would be:

    beq afteradd       ; skip add if equal
    add r0, r0, #15    ; add 15 to R0
afteradd:

When running on a core that uses Thumb-2, the conditional execution is more limited, but still possible by using the IT instruction. This instruction will create an "if-then" construct without using branches. The IT construct is actually part of ARM's Unified Assembly Language, and should be used whether you're writing for ARM or Thumb-2. Here's the UAL version of the conditional add from above.

it eq                 ; if equal
addeq r0, r0, #15     ; then add 15 to r0

The IT construct can include more than just a single instruction. Using a series of T and E in the instruction, you can add more conditional instructions. In the below example, we will add 15 to R0 if the zero flag is set, otherwise we will subtract 15 from R0. The ITE means, fairly literally, if-then-else. The first following instruction should have a condition that matches your ITE condition, and then the second instruction will become the "else" and should have a condition that is the opposite of the ITE condition.

ite eq                ; if equal
addeq r0, r0, #15     ; then add 15 to r0
subne r0, r0, #15     ; else subtract 15 from r0

I guess this kind of does use if, which may go against what you were asking. In terms of the execution, this implements the if without using a jump, though.

edited Sep 02 '16 at 16:38

answered Sep 02 '16 at 13:48

rjp

1,760
13
15

ARM has `B/BL/BX/...` branches. Jump instructions in my book. `B` even has the exact same semantics as `JMP`. – Johan Sep 02 '16 at 13:51
1

It does, but if someone is talking about an instruction called `jump`, it would not be a good assumption that they are referring to ARM. I really just wanted to write this answer because I feel like the ARM conditional instructions are a really interesting concept. – rjp Sep 02 '16 at 13:52
1

Yes, that's an excellent point, but you degrade an otherwise excellent answer by implying that ARM does support the concept of a `jump`. A rose by any other name and all. I'd remove that part... Oh and Thumb v1 does not support `IT`. By that same token x86 also has no `jump` or `goto` instruction. Nowhere in the Intel documents can I find an instruction who's Mnemonic is JUMP or GOTO. – Johan Sep 02 '16 at 13:56
1

Predicated execution is nice because it lets you conditionally store, but you can still achieve that effect with cmov and a dummy location to dump stores that you don't want. It's more efficient, but I think no more powerful than what you can do with x86-style branchless computation. See my answer that takes a more "what could such a machine compute" approach (theory of computation, not efficiency). I decided it was time to finish writing more after writing about conditionally-branching to the interrupt handler by conditionally dividing by 0. :P (or other stuff with Self-modifying-code...) – Peter Cordes Sep 02 '16 at 14:36
@Johan I removed the Thumb reference (keeping Thumb2). I never used Thumb mode when I was working on ARM9, so I probably misread my quick reference. – rjp Sep 02 '16 at 16:39
@Peter your answer is great. I've only done x86 on an 8086 and a 486, and while `cmov` probably existed on those, I never had a need for it. When I'm at a desktop, I'll give it a thorough read through. – rjp Sep 02 '16 at 16:41
cmov was new with P6 (Pentium Pro / Pentium II) – Peter Cordes Sep 02 '16 at 17:01

score 5 · Answer 2 · edited May 23 '17 at 12:22

See @Marco van de Voort's answer for a list of simple "normal" ways to write branchless code (cmov/other predicated execution, cmp/setcc, copy flags to a GP register, etc). That forms a building block for making branchless versions of things which would be much better done with branches.

When you say "make decisions", it sounds like you're really asking if such a machine would be fully programmable.

Part of the Theory of Computation in computer science is figuring out how powerful different models of computation are. This isn't about how efficient a model is, it's about what can be computed with it at all. I suspect that a register machine without conditional branch instructions would not be equivalent in power to a Turing machine, aka Universal Computer. (Even if you allowed non-indirect unconditional branches).

Update: x86's mov instruction is Turing-complete even without self-modifying code, just using normal indexed addressing modes. You also need a loop around the sequence of loads and stores. TODO: rewrite the parts of this answer invalidated by that. This increases the value of unconditional branches without self-modifying-code (e.g. letting IP wrap around in real mode, or other weird tricks for doing this without a jmp, see below).

I think the mov-only Turing Machine is sort of simulating another machine in the registers and memory of x86, so instead of the x86 [R/E]IP branching, a value in a general-purpose register behaves like a program counter that's branching. (Or a Turing machine that's traversing a tape)

You can make very stripped down architectures which are still Univeral Computers, but what I gather from what I've seen so far is that any model of computation similar to a register machine (like x86, ARM, etc.) needs some version of a conditional branch to be a Universal Computer.

See also this SO question: Minimal instruction set to solve any problem with a computer program, especially the Bit-manipulating machine part of Wikipedia's One Instruction Set Computer article; it's least like a conditional branch, but still does the same job.

I don't know enough theory of computation to put accurate bounds or a formal description on what you can compute without conditional branches (on a register machine like x86 or ARM), but I can say something:

I can say that given enough space for instructions, you could sort an array, but maybe only if you generated the right sequence of instructions for that array length. Using branchless techniques like cmov, you can conditionally swap two memory locations based on a comparison. A good choice of algorithm would be a sorting network.

You can fully unroll any loop of known length, and maybe handle variable-length loops by unrolling the maximum number of iterations you will ever need. You can use branchless techniques to load/store from dummy addresses when your loop counter is past the end of the array. (e.g. use cmov to set your pointer to point to a small static array that only exists for you to store garbage into without affecting anything else. Your code will always do the same fixed number of stores, but with different inputs different ones of those stores can go into the dumping ground.) ARM predicated execution of any instruction means you can do conditional stores without needing a static dumping ground.

Fully unrolling Merge Sort should work well, because it always does the same amount of work for a given array size. Quick Sort, on the other hand, would be very inconvenient because the amount of work varies with the data and choice of partition. Moreover, it has an O(n^2) worst case, so you'd need at least that much code.

Since you have to fully unroll anything that you want to run, it means you need to know how long it will run. This means you can't implement a full Turing Machine, because not all Turing Machines do halt. (Not to mention that it's impossible for an algorithm to exist that decides whether any given Turing Machine halts).

Self-modifying code might increase the efficiency of this, but I don't think it could increase the computational power if you're still not allowed to rewrite your own machine code to include branch instructions. (We have to exclude SMC from generating any kinds of branches, not just conditional-branch instructions, because we could conditionally produce an unconditional-branch instruction.)

I think SMC in a fixed loop is probably more powerful than non-self-modifying code in a loop, even if the SMC isn't allowed to generate jump instructions. See below for some crazy ideas for looping without branch instructions (e.g. timer interrupt and IP wraparound).

With unconditional jumps only (no conditional jumps), you can make loops to handle arbitrary-size versions of the same problems as without jumps. But then your program can't terminate. If there was some kind of "terminate program" instruction, putting it inside the loop would make it a non-loop.

So what you could end up with is a program that eventually has an array sorted (or whatever its job was), and maybe it signals this fact by storing a "true" somewhere, and keeps looping without affecting the data (using branchless conditional code). You can sort of think of this as having a "transport-triggered" exit function, so it can be used conditionally.

More ways to emulate an unconditional jump so you can create a loop, I think increasing the power of self-modifying code. (Even if you refrain from using that power to conditionally branch).

function-call instructions can easily be abused them to implement plain jumps, so we'd obviously want to exclude them.

On x86, you could use call to jump to code that just pops the return address of the stack instead of ever returning. (So you could implement a loop). To build a conditional branch out of x86's indirect call instruction, you'd use branchless techniques (like cmov) to get either the instruction after call or a different destination into a register, and run call rax. Or more simply, with eax set to 0 or 1 (e.g. from setcc), you could run call [target_table + rax*8] to select between two destinations for the branch.

Similarly, ret is easy to abuse as an indirect jump: push an address and run ret. (Don't do this in real life: CPUs are optimized for correct call/ret nesting, and mismatching them causes the return-address predictor stack to mis-predict on future RETs.)

Interrupts cause the CPU to jump to the interrupt handler. Software interrupts (like x86's int 0x80 instruction, or ARM svc / swi) synchronously jump to the interrupt handler, so they're effectively just a function all (and a mode switch into kernel mode).

So that means int is another jump instruction we need to exclude. But we can also use the timer interrupt to loop. The interrupt handler is the top of the loop. Do whatever amount of work you want and then re-enable interrupts. Execute NOPs after that; the next timer interrupt will eventually fire. Or on a machine that leaves interrupts enabled while servicing interrupts: limit your loop body to an amount of work that can always fit between timer interrupts, even with worst-case cache behaviour.

Another way for software to generate an interrupt is by running an illegal instruction. Self-modifying code can conditionally branch to the illegal-instruction interrupt handler by conditionally generating an illegal instruction. Or conditionally branch to the division-error handler by conditionally dividing by 0. Since you can set the address of that interrupt handler (e.g. by storing into Interrupt Vector/Descriptor Table on x86), you can conditionally branch to an arbitrary address using a store and a divide by 0. (All of your branch targets need to do whatever is necessary to re-enable that interrupt as being triggerable, though).

Let the program counter wrap around: On a flat-memory von-Neumann architecture machine, this requires self-modifying code to keep track of the program state, because there's no memory that doesn't execute as code.

In segmented x86 (e.g. 16-bit Real Mode) IP will wrap around from 0xFFFF to 0x0000. So your loop is the whole code segment, and you have separate data storage in other segments. You can only modify CS with a far jmp or far call/far ret (or interrupt / interrupt-return), so you can't do anything like pop cs to jump (because pop cs doesn't exist, unlike the other pop-segment-register instructions).

Fun fact: The last instruction at the end of the code segment doesn't have to end at 0xFFFF. e.g. a multi-byte instruction could start at 0xFFFF and end at 0x04. It's possible to write x86 machine code that executes different ways if decoded from different start points, but this is not useful here. (This imposes huge constraints on what you can do).

ARM makes the program counter accessible as a normal general-purpose register, so you can do control-flow with sub r15, #1024 or something, not technically using a b (branch) or any kind of call (bl) or return instruction. You're still directly modifying the program counter, though, so it's really just a branch instruction. (pc is another name for r15).

Not only is x86 turing-complete without conditional branches, it is complete if you [eliminate every instruction other than mov](https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf). The construction doesn't even use SMC, but it does need a single unconditional branch back to the start of the program to support non-termination (but all terminating programs, which are the interesting ones, can be written without it). There is even a [compiler](https://github.com/xoreaxeaxeax/movfuscator) for it. — BeeOnRope, Sep 02 '16 at 15:07
oops, I've even seen that paper before, but I totally forgot about it. I need to sleep; maybe I'll make a more serious edit when I get up. — Peter Cordes, Sep 02 '16 at 15:13
I believe an explicit theoretical approach is only useful when strictly formal. Conditional branches are just a courtesy, basically just the programming languages have them. Every other model (from PRF to TM, passing for Godel T) don't have them and they are all complete. There's nothing special about them and even a register machine doesn't need them in principle. Also the part about unrolling isn't quite right: You seems to mix the notion of problem (i.e. language) with that of implementation (i.e. algorithm). The unrolling may be endless but the emulation would be correct anyway (endless) — Margaret Bloom, Sep 02 '16 at 15:45
@MargaretBloom: I think I got mixed up by looking at minimal instruction-set machines, like one-instruction or dec/jnz-with-built-in-jumps machines. re: infinite unroll: I was still thinking of a real x86 or ARM CPU with finite address space. At this point this post is more a jumble of related ideas rather than one coherent answer, unfortunately. — Peter Cordes, Sep 02 '16 at 15:57

Marco van de Voort · Answer 3 · 2016-09-02T12:12:45.317

4

Assuming you mean x86,

conditional moves (cmov) as said,
the older set instructions (set al according to a flag)
pushf/popf solutions
SSE2 (part of the default set on 64-bit x86_64) has pcmpeqb and similar instructions to get the result of comparisons into registers.
FPU instructions like FCOM+LAHF (from Ruslan, see below)
use a lookup table to map input values to output values (from Ped7g below)

all allow to get the results of decisions (cmp*) in registers, where they can be further manipulated by shifts and results.

This is mainly useful to create a function return value without branches or do certain forms for stream processing (e.g. SSE2 based binarization of an image). It is not really a general purpose substitute for branching.

edited Sep 02 '16 at 12:12

answered Sep 02 '16 at 10:47

Marco van de Voort

25,628
5
56
89

1

5. FPU instructions like `FCOM`+`LAHF`. – Ruslan Sep 02 '16 at 10:56
Thanks, added it as 5th option – Marco van de Voort Sep 02 '16 at 11:19
2

Other general-purpose branchless techniques, like ANDing a boolean to mask an operand to zero is effectively the same thing as cmov (or achieves the same result a different way). e.g. instead of ADD then CMOV, mask one of the inputs to AND so you either add zero or the value you were going to add. In SIMD programming, this is often better than blending (cmov). (Zero is the additive identity, and also the identity value for many other operations, like XOR, OR, and SUB. The multiplicative identity is 1, so you need to conditionally convert a multiplicand to 1 to produce a conditional multiply) – Peter Cordes Sep 02 '16 at 11:34
2

Also you may cheat it by look-up-table in some cases, for example to convert 0-15 value in register `al` into hexa `0-9A-F` char you can do `mov al,0..15` `mov ebx,hexlut` `xlatb` ... with `hexlut: db '0123456789ABCDEF'`. – Ped7g Sep 02 '16 at 12:07
1

@Ped7g: If you want ways to cheat, how about looping by letting `IP` wrap around. :D See my answer. – Peter Cordes Sep 02 '16 at 14:41

Is it possible to make decisions in assembly without using `jump` and `goto` at all?

3 Answers3

Linked