For integers, the notion of data type goes to both width and signed-ness.
Data type is for variables and can be signed or unsigned. Variables can hold values, and, integer values can be negative of positive.
There are some 4 standard widths in assembly, byte, word, long, quad, each twice as large as the prior. These do not necessarily indicate whether signed or unsigned. In the x86 world, word
is 16-bits, whereas in most other environments (MIPS/RISC V) word
refers to 32-bits. Further, long
is sometimes called dword
for double word, qword
for 8 byte values.
There are some 4 standard widths in C, char, short, int, long, but in C they are generally understood as signed — except char
has implementation specific signed'ness. C guarantees that sizeof(char) < sizeof(short) <= sizeof(int) <= sizeof(long), but to know exactly which is what you must consult the implementation's documentation — an implementation is supposed to tell you. Many implementations have int
and long
both as 32-bits, but sometimes there are compiler options to change that, and, long long
is usually 64-bits.
In C, we can add keyword signed
or unsigned
to ensure the data type is one that can hold negative values or cannot hold negative values, respectively.
For comparison operations broadly across both signed and unsigned data types, there are a total of 10 usual relations. Let's note now that programming languages and instruction sets omit relational operations where one operand is signed and the other is unsigned (and vice versa). If you have such a situation, the best approach is to promote both operands the next higher signed size and do the comparison that way. So, included in the standard 10 relations are unsigned to unsigned and signed to signed comparisons (but no signed to unsigned and no unsigned to signed).
Two of them equal (==,eq) and not equal (!=,ne), apply the same to signed and unsigned data types both — to be equal or different, the bit pattern must be identical and signed'ness doesn't matter there (given that both operands are either signed or both are unsigned).
For the rest, we must know the signed'ness of the data type to interpret results properly. A negative number, if accidentally viewed as unsigned, looks like a large positive number. So, if we use the wrong comparison operator, then -1 will appear as maxint and be larger than 1. That's why we must know the data type. We can infer whether the data type is signed or unsigned from the comparison operator.
The industry has generally settled on terminology:
- above & below for unsigned > and unsigned <
- above or same & below or same for unsigned >= and unsigned <= (68000)
- above or equal & below or equal for unsigned >= and unsigned <= (Intel)
- less than unsigned (ltu) for unsigned < (MIPS/RISC V)
- less than & greater than for signed < and signed >
- less than or equal & greater than or equal for signed <= and signed >=
Let's also add that C (and other high level languages) use logical variable declarations to tag variables with data types, and with that the compiler generated machine code accesses the same variable's physical storage consistently as that data type, whenever the program uses variables.
Whereas in machine code, there are no variable declarations that the processor sees or knows about, and so, some data type information must be conveyed with every instruction that manipulates storage as needed. To copy data the processor only needs to know size, not signed'ness, same for comparison by equal/not-equal, but for other operations (other inequalities like <, <=, or to detect overflow) the processor must be informed of the data type's size and signed'ness.
There's at least two reasons that processors don't read variable declarations, and one is that it would be too much for them to remember, or to put it another way, we have another way of remembering, which is incorporating that information into the machine code of program, which means that the program really knows, and tells the processor at every instruction.
The other is that the physical storage of the processor: CPU registers and memory, are frequently being repurposed. The CPU registers are permanent, but logical variables of high level languages can be ephemeral — especially parameters and local variables. Logical variables have scope and when the scope exits, the variable disappears, leaving the physical storage free to be reused for another purpose, which the assembly program does by simply initializing that physical storage with a new value. Thus, one moment the same register may hold an unsigned byte and another moment a signed integer. The machine code program's job is to keep that straight, and the compiler does it in part through type declarations in the source code it is translating.
Conditional branching is somewhat complex, as follows. In C, we might do something like
if ( a < b ) goto Label;
This relatively simple operation essentially has 4 operands, more than most processors accommodate in one instruction. The 4 operands are variable 1, variable 2, the specific relational operator, and the goto-target label.
So, one approach used by instruction set designers is to split the 4 operands and spread them out into 2 separate instructions, like compare
and branch
or set
. The compare operation takes the variable 1 and variable 2 and does all the 10 comparison operations simultaneously, putting all 10 results into the flags register. The branch
instructions take the specific relational operator and the goto-target label — they interpret the flags given the relational operator to see if it should branch or not.
The setxx
instructions parallel the branch instructions, but have a register target (a boolean) rather than a goto/branch target.