Can we say that an x86 CPU has data types?

Question

An x86 CPU have some instructions that deal with integers and floating-point numbers.

For example: the INC instruction increments an integer (which can be stored in memory or in a register) by 1, so the INC instruction "knows" that it should interpret the bits that it is manipulating as an integer. So can we say that an x86 CPU have data types (in the same way we can say that C++ have data types)? or in order for us to be able to say that, an x86 CPU should provide other features like type safety (which it doesn't provide)?

No. `INC` increments the contents of a memory location or register. It has no concept (or concern) about the type of those contents, because they're all numeric to the CPU, whether that content is a pointer, the start of a string, a character, or an integer. Compilers provide type safety. See [Raymond Chen's blog](https://blogs.msdn.microsoft.com/oldnewthing/20190121-00/?p=100745) for a series of articles that may help clarify things. — Ken White, Feb 07 '19 at 13:13
@KenWhite: I beg to disagree. Having types != having a type-safe type system. x86 provides several data types on which it can operate natively (byte, word, dword, qword on x64, tword for `mov`s to the FPU, plus the various floating point types). `inc` can operate on these integer data types, specified by the encoding of the destination operand (`byte ptr`, `word ptr`, `dword ptr`, `qword ptr`) or implicit in the choice of registers. — Matteo Italia, Feb 07 '19 at 13:28
@MatteoItalia: Yes, but the INC instruction only knows that it's operating on `byte ptr`, `word ptr`, etc.; it has no idea what that `word ptr` is pointing to, only that it is a `word ptr`. It's a numeric value (a `word` sized number) that is "supposed to be" a `ptr` to something - that's all the CPU knows. — Ken White, Feb 07 '19 at 13:34
@KenWhite: I think we all know that `inc` just feeds the operand to a binary adder. But I agree that using the "knows" to express this seems inaccurate. Hopefully just clumsy wording on the OP's part. I said the same thing in my answer. — Peter Cordes, Feb 07 '19 at 13:54
@KenWhite Wouldn't `inc word ptr [bx]` increment the value pointed to by `bx`, not `bx` itself? — puppydrum64, Dec 20 '22 at 14:11

score 13 · Answer 1 · edited Aug 15 '22 at 08:18

Yes, asm has operations that work with data in different formats, and you could call those types. But there is zero type safety. That's a good way to express it.

so the INC instruction "knows" that it should interpret the bits that it is manipulating as an integer.

But that's a clumsy way to express this. INC doesn't "know" anything; it just feeds the operand to a binary adder in an ALU. It's completely up to the programmer (or compiler) to use the right instructions in the right order on the right bytes to get the desired result. e.g. to implement high-level variables with types.

Every asm instruction does what it says on the tin, no more, no less. The Operation section in the instruction-set reference manual entry documents the full effect it has on the architectural state of the machine, including FLAGS and possible exceptions. e.g. inc. Or a more complicated instruction with more interesting pseudocode that show where each bit gets deposited, BMI2 pdep r32a, r32b, r/m32 (and diagrams). Intel's PDF that these are extracted from has an intro section that explains any notation like CF ← Bit(BitBase, BitOffset); for bts (bit test-and-set)

Everything is just bytes (including pointers, and floats, integers, strings, and even code in a von Neumann architecture like x86). (Or on machines with some things that aren't a multiple of 1 byte, everything is just bits.)

Nothing will magically scale indices by a type width for you. (Although AVX512 does use scaled disp8 in addressing modes, so an 8-bit displacement can encode up to -128..+127 times the vector width, instead of only that many bytes. In source-level assembly, you still write byte offsets, and it's up to the assembler to use a more compact machine-code encoding when possible.)

If you want to use inc al on the low byte of a pointer to cycle through the first 256 bytes of an (aligned) array, that's totally fine. (And efficient on CPUs other than P6-family where you'll get a partial-register stall when reading the full register.)

It's true to some degree that x86 has native support for many types. Most integer instructions come in byte, word, dword and qword operand size. And of course there are FP instructions (float / double / long double), and even the mostly-obsolete BCD stuff.

If you care about signed vs. unsigned overflow, you look at OF or CF respectively. (So signed vs. unsigned integer is a matter of which flags you look at after the fact for most instructions, because add / sub are the same binary operation for unsigned and 2's complement).

But widening multiply, and divide, do come in signed and unsigned versions. One-operand imul vs. mul (and BMI2 mulx) do signed or unsigned N x N => 2N-bit multiplication. (But often you don't need the high-half result and can simply use the more efficient imul r32, r/m32 (or other operand size). The low half of a multiply is the same binary operand for a signed or unsigned interpretation of the inputs; only the high half differs depending on whether the MSB of the inputs has a positive or negative place-value.)

It's not always a good idea to use the same operand size as the C++ data type you're implementing. e.g. 8-bit and 16-bit can often be calculated with 32-bit operand-size, avoiding any partial-register issues. For add/sub, carry only propagates from LSB to MSB, so you can do 32-bit operations and only use the low 8 bits of the result. (Unless you need to right-shift or something.) And of course 8-bit operand size for cmp can be handy, but that doesn't write any 8-bit registers.

x86 data types/formats include much more than just integer

signed 2's complement and unsigned binary integer
IEEE float and double, with SSE and SSE2, and x87 memory operands.
half-precision 16-bit float (vcvtph2ps and the reverse): load/store only. Some Intel CPUs have half-precision mul/add support in the GPU, but the x86 IA cores can only convert to save memory bandwidth and use at least float for vector FP math instructions.
80-bit extended precision with x87
80-bit BCD with x87 fbstp
packed and unpacked BCD, supported by the AF flag (nibble-carry) and instructions like DAA (packed-BCD decimal adjust AL after addition) and AAA (ASCII adjust after adition: for unpacked BCD in AL, AH). not in 64-bit mode
bitmaps with bt/bts/etc: bts [rdi], eax can select a bit outside the dword at rdi. Unlike with a register destination, the bit-index is not masked with &0x1f (https://www.felixcloutier.com/x86/bt). (This is why bt/bts/etc mem,reg is so many uops, while reg,reg and mem,immediate are not bad).

See also How to read the Intel Opcode notation for a list of all notation used in Intel's instruction-set reference manual. e.g. r/m8 is an 8-bit integer register or memory location. imm8 is an 8-bit immediate. (Typically sign-extended to the operand-size if that's larger than 8.)

The manual uses m32fp for x87 FP memory operands, vs. m32int for x87 fild / fistp (integer load/store), and other integer-source x87 instructions like fiadd.

Also stuff like m16:64, a far pointer in memory (segment:offset), e.g. as an operand for an indirect far jmp or far call. It would certainly be reasonable to count far pointers as a "type" that x86 supports. There are instructions like lgs rdi, [rsi] that loads gs:rdi from the 2+8 byte operand pointed to by rsi. (More usually used in 16-bit code, of course.)

m128 / xmm might not be what you'd really call a "data type" though; no SIMD instructions actually treat the operand as a 128-bit or 512-bit integer. 64-bit elements are the largest for anything except shuffles. (Or pure bitwise operations, but that's really 128 separate AND operations in parallel, no interaction between neighbouring bits at all.)

Note that there are still byte-oriented machines with a register width that is not a multiple of 8 like MSP430X systems. You might want to avoid saying “everything is just bytes.” — fuz, Feb 07 '19 at 15:17
I think a more correct term would be "format". Types, as understood in mathematics and CS, is a label attached to data in order to restrict the operations permitted on the data. So I don't think you can have types without type safety. But, yes, that's a thin line and debatable. — Margaret Bloom, Feb 07 '19 at 16:17
@MargaretBloom Perhaps “interpretation” is an appropriate word. — fuz, Feb 07 '19 at 17:41

score 0 · Answer 2 · edited Aug 14 '22 at 21:24

It is just bits, nothing more. The bits being operated on by inc could be a signed integer, unsigned integer, it could be a pointer to something an address. It could even be a floating point number that some clever (or an opposite word) code is using to round up the mantissa.

Some instructions like multiply and divide if operating on different sized quantities of bits, two 8 bit operands coming in resulting in a 16 bit output, have a notion of sign for twos complement machines an unsigned multiply and signed multiply are different and it is only because they need to sign extend one of the operands in order to complete that operation. if you do n bits in and n bits out then you don't even care about sign it is still just bits. divide is similar.

One could say that floating point operations imply the bits represent that format and that is fair.

But the notion of a unsigned int vs char * vs float, etc lie mostly in the programmers brain and in the high level language, processors are very very very dumb they take the bits they are fed instruction and data and operate on them it is the job of the programmer ultimately to make sure those bits are instructions and the data is data and it performs the desired task. The processor is just a bit manipulation machine the definition of what each instruction does is written down so that you know what bits you will get out based on the bits you feed it.

Trying to make assembly language or machine code have types is mostly a waste of time, some syntaxes have things like mov word ptr and such but that is the nature of the instruction set and more importantly the assembly language, other syntax could have been used and later was used to get the right machine code generated without using the word pointer or ptr to simply state this is an indirect addressing mode.

Trying to understand assembly or machine code in the context of a high level language doesn't really work you have to try to think the other way around. These are just bits and most languages have types to describe those bits so the code works. Some languages go so far as to have the same 8 bit value have to be converted from a boolean to an integer or to an (ASCII) character. Just to make the language work.

The simplest one to understand is inc or add, if you take two integers in your high level language or an integer and an immediate and do an operation that makes sense hello = hello + 1; But you can tell the difference with respect to that instruction from char *x; ... x++; you still get some register or memory reference and an immediate in an add. the processor doesn't know nor care that one is a variable/integer and the other is an address it is just operands and output.

Can we say that an x86 CPU has data types?

2 Answers2