256 bit fixed point arithmetic, the future?

Question

Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:

You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.

One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?

Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?

Waste of memory bandwidth, almost everything you see now can be represented in 64 bit. And we already able to calculate 256 bit numbers using avx2 — ardhitama, May 09 '16 at 15:42
A valid point about the memory. So you wouldn't want to represent everything with 256 bits, you'd want to still be picky if you were to use your resources efficiently. I know we can currently do 256 bit maths, but I'm guessing it takes a lot more cycles to calculate than if the hardware was designed for it. Or does it? Im not an expert on hardware yet — Iron Attorney, May 09 '16 at 15:52
@ardhitama: Even AVX512 has a maximum SIMD element width of 64bits for add/sub. IDK if it has a 64bit multiply, but AVX2 doesn't. Anyway, doing four 64bit adds at once isn't the same thing as doing a full 256bit add with no breaks in carry propagation. Doing an `adc` chain with vectors of 64bit elements [is possible, but requires an instruction set designed for it.](http://www.agner.org/optimize/blog/read.php?i=421#548). — Peter Cordes, May 09 '16 at 18:38
upvoted because it's a reasonable question with an interesting answer, even if the answer is a resounding no :P — Peter Cordes, May 09 '16 at 18:53
The main reason for upping the integer size in the past has been to increase memory addressability. Post 32-bits addressability is currently needed in some instances and 64-bit should be large enough for the forseeable future and well beyond. — zaph, May 09 '16 at 20:37
These are all great answers! @zaph I hadn't thought of the fact they might have increased to 64 bit just for memory addressing, but I suppose there has to be a pretty good reason to make a big architectural change. So, seeing as we do get by with 64bit precision or less for almost everything (probably everything really?) and seeing as we're not likely to need more than 18 quintillion bytes of memory any time soon, is it not likely that we'll step up to 128bit architectures any time soon? — Iron Attorney, May 10 '16 at 08:44
The x86 already has some support for 128-bit integer operations. It can do 64bx64b to 128b scalar multiplication for example. — Z boson, May 10 '16 at 12:47
The range of fractals for zooming can go well beyond the limits of 256-bit fixed point arithmetic. — Z boson, May 10 '16 at 12:49
Isn't 64bx64b just 64b maths which accounts for results in the 128bit range? For 128bit maths, you'd need to account for 256b answers wouldn't you? Tell me more about this fractal zooming business! It sounds intriguing :D — Iron Attorney, May 10 '16 at 13:23
@IronAttorney, I think 64bx64b to 128b is more than 64b math. See my answer [here](http://stackoverflow.com/questions/34234407/is-there-hardware-support-for-128bit-integers-in-modern-processors/34239917#34239917). It's much more complicated to do 128-bit multiplication if you only have 64bx64b to 64b (lower). In fact 64bx64b to 64b (lower) is not much better than 32bx32b to 64b. It only helps for signed multiplication. In other words the double world product of two words (e.g. 64bx64b to 128b) is very useful to calculate the product of two double words (e.g 128bx128b to 128b (lower)). — Z boson, May 12 '16 at 11:23
You're right actually aren't you, if you're dealing with 128 bit maths, the answer should stay in the 128 bit range shouldn't it. So the 64x64 to 128 is efficient enough at calculating 128 bit maths is it? — Iron Attorney, May 22 '16 at 12:27
I've heard rumours of 128 bit architectures being developed. Does anyone have any solid info on that? I have also heard that Power PC have developed 512 bit architecture for military use... Is that true? If it is, being military, I can't imagine it being easy to look up, haha. But I also can't see why you'd ever need 512 bit architecture... Sure, you could process 8 packets of 32 bit data simultaniously, but that doesn't feel like it justifies the complexity of making such a beast. — Iron Attorney, Nov 16 '17 at 12:54
@bhspencer nice find! I'm don't entirely understand it yet, by which I mean I don't understand why it claims to be so consistently accurate. I'll have to read that a few more times I reckon. What are the chances this might end up in c++20 dyou think? — Iron Attorney, Nov 16 '17 at 13:00
@bhspencer Wait, I get it, it's exponent is *10^e rather than *2^e. That's pretty smart. And this is efficient to implement? One note here, if this was scaled up to Dec128... the representable range would be far beyond the reach of a 256 bit fixed point with a binary exponent. Does this mean Dec128 might be the future instead? — Iron Attorney, Nov 16 '17 at 13:49
@IronAttorney you got it. Uses base 10 rather than base 2. If we are going to go to the trouble of implementing a new ALU in hardware this seems like a better choice than just making the base 2 float wider. It would be nice if 0.1 + 0.2 == 0.3. — bhspencer, Nov 16 '17 at 16:16
@IronAttorney Crockford has provided a reference implementation in x86 assembly so it could be included but software implementations of primitive number types are always slow. We really need a major CPU manufacturer to decided to implement it in their ALU. Reminds me of programming early android phones that didn't have an FPU. — bhspencer, Nov 16 '17 at 16:25
@bhspencer I'm in full support of this. ARM seem like the current big thing in CPUs, especially with apple talking about moving more of their products over to them. How can we twist their ARM? — Iron Attorney, Nov 16 '17 at 19:05

Peter Cordes · Accepted Answer · 2017-11-15T21:47:14.143

SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.

Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.

Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.

So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.

Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2^-1022 to 2¹⁰²³, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2^-127 (with 1 bit of precision) to 2¹²⁷ (with 256b of precision).

Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)

Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.

AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).

AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.

Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.

For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.

Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.

The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.

Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)

Quick estimate: that multiplier would need about 32k full adders just for the reduction stage, compared to 2k for a 64bit multiplier. — harold, May 09 '16 at 19:10
For the record, AVX512 will have 64bit * 64bit to lower 64bit but it might be quite slow anyway. — Z boson, May 10 '16 at 07:46
@Zboson: I only found AVX512IFMA. Is there a 64*64 => low64 multiply that I missed? — Peter Cordes, May 10 '16 at 08:10
Well at least there is an intrinsic `_mm512_mullo_epi64`. According to the intrinsic guide it maps to `vpmullq` in AVX512DQ. — Z boson, May 10 '16 at 08:14
These are also great answers. So the general consensus is that it would be a lot of effort/expense, with not nearly enough payback. No 128 bit architecture likely in the near future then? I was just trying to look up the dates that 16bit, 32bit and 64bit architectures came about to see if I could work out a relative time scale based on them, but it's hard to get a grasp on any meaningful dates, because the upgrades in architecture happened bit by bit, and in different areas with different companies. — Iron Attorney, May 10 '16 at 09:03
There has been a progressive changes WRT address space. Early on a 16-bit address space was shared by all processes including the OS. The next step was to provide separate 16-bit virtual address spaces for instructions and data. Next were address spaces per process. Then 20 and 24 bit address spaces followed by 32-bit address spaces which we thought at the time was the final solution! Now we are at 64-bit address spaces and this does look like the last final move. (I was personally involved in a 16 to 32 bit OS port.) — zaph, May 10 '16 at 12:06
@Zboson: Thanks, it was hiding in the same entry as `pmulld` in the intel future extensions pdf. I was expecting a `vpmul*` name, since I knew it had to be a new instruction. — Peter Cordes, May 10 '16 at 18:37

score 2 · Answer 2 · answered Nov 15 '17 at 21:36

2

128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.

answered Nov 15 '17 at 21:36

Nuutti

401
3
5

Interesting point, maybe we'll have 80-bit, 96-bit or 128-bit pointers at some point. We'll probably still use 32-bit `int` for many things, though. This feels like an answer to a different (but related) question. – Peter Cordes Nov 15 '17 at 21:53
Interesting angle, and you're right. We're not likely to max out 64 bit memory addressing for a long time. – Iron Attorney Nov 16 '17 at 12:54
sizeof(vertual_memory) != sizeof(physical_memory). Disk and flash drives are also getting larger and cheaper. – jwdonahue Jun 30 '18 at 16:27

256 bit fixed point arithmetic, the future?

2 Answers2