What is the fastest way to perform hardware division of an integer by a fixed constant?

Question

I have a 16 bit number which I want to divide by 100. Let's say it's 50000. The goal is to obtain 500. However, I am trying to avoid inferred dividers on my FPGA because they break timing requirements. The result does not have to be accurate; an approximation will do.

I have tried hardware multiplication by 0.01 but real numbers are not supported. I'm looking at pipelined dividers now but I hope it does not come to that.

[Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935/995714) — phuclv, Mar 23 '19 at 16:02

Doug Currie · Accepted Answer · 2014-07-16T20:29:58.357

14

Conceptually: Multiply by 655 (= 65536/100) and then shift right by 16 bits. Of course, in hardware, the shift right is free.

If you need it to be even faster, you can hardwire the divide as a sum of divisions by powers of two (shifts). E.g.,

1/100 ~= 1/128                  = 0.0078125
1/100 ~= 1/128 + 1/256          = 0.01171875
1/100 ~= 1/128 + 1/512          = 0.009765625
1/100 ~= 1/128 + 1/512 + 1/2048 = 0.01025390625
1/100 ~= 1/128 + 1/512 + 1/4096 = 0.010009765625
etc.

In C code the last example above would be:

uint16_t divideBy100 (uint16_t input)
{
    return (input >> 7) + (input >> 9) + (input >> 12);
}

edited Jul 16 '14 at 20:29

answered Jul 16 '14 at 16:43

Doug Currie

40,708
1
95
119

2

Since 65536/100 is not an exact value, you need to do some analysis to make sure the result is within your margin of error. More bits may help, or maybe you can get by with fewer. – Mark Ransom Jul 16 '14 at 16:53
Are you implying implementing a shift-add architecture with "divide as a sum of divisions by powers of two"? it is common to implement that over multiple clock cycles. – Morgan Jul 16 '14 at 20:18
@Morgan, sure you can use pipelining and do one add per cycle, or you can use up some real estate to make a 3 input adder and do it in one cycle. – Doug Currie Jul 16 '14 at 20:51
3

@Morgan actually divide first then add is hopelessly inaccurate - for the input range of 0-65535, 36016 are off by 1 (both directions) and 5580 are off by 2! If you change the expression to `((n<<5)+(n<<3)+n)>>12` you get 20640 off by 1 errors, always high. It's the equivalent of `n*41/4096`. – Mark Ransom Jul 17 '14 at 15:58
@Mark, yes, despite the C code, in my head I was picturing an adder that extended up to 12 bits to the right of the "decimal point." I.e., a (16 - 7 + 12 =) 21 bit adder. Your expression is a good way to express that since it shows clearly that we need at least 16 + 5 bits for full accuracy. Thanks for the analysis. Note that the original question includes "The result does not have to be accurate; an approximation will do." – Doug Currie Jul 17 '14 at 20:25
You will also need access to the carry bit on the output, making a total of 22 bits. But I've done a bit more experimenting and found that an expression of `(n+(n>>2)+(n>>5))>>7` is even more accurate with only 20448 off by 1 errors, and it only requires 17 bits! – Mark Ransom Jul 17 '14 at 20:33
Note that 0.01 ~= (0.000000101001)b, so 7th, 9th and 12th bits are open. – sikerbela Jun 18 '18 at 02:47

score 5 · Answer 2 · answered Jul 16 '14 at 18:24

Assuming that

the integer division is intended to truncate, not round (e.g. 599 / 100 = 5)
it's ok to have a 16x16 multiplier in the FPGA (with a fixed value on one input)

then you can get exact values by implementing a 16x16 unsigned multiplier where one input is 0xA3D7 and the other input is your 16-bit number. Add 0x8000 to the 32-bit product, and your result is in the upper 10 bits.

In C code, the algorithm looks like this

uint16_t divideBy100( uint16_t input )
{
    uint32_t temp;

    temp = input;
    temp *= 0xA3D7;     // compute the 32-bit product of two 16-bit unsigned numbers
    temp += 0x8000;     // adjust the 32-bit product since 0xA3D7 is actually a little low
    temp >>= 22;        // the upper 10-bits are the answer

    return( (uint16_t)temp );
}

score 1 · Answer 3 · answered Jul 16 '14 at 18:37

1

Generally, you can multiply by the inverse and shift. Compilers do this all the time, even for software.
Here is a page that does that for you: http://www.hackersdelight.org/magic.htm
In your case that seems to be multiplication by 0x431BDE83, followed by a right-shift of 17.

And here is an explanation: Computing the Multiplicative Inverse for Optimizing Integer Division

answered Jul 16 '14 at 18:37

user541686

205,094
128
528
886

This might work well for assembly level optimisation, but the OPs 'timing' refers to propagation delay through the logic. Multiplication by the numbers expressed here would result in the requirement for a large multiplier, 32x32 which might not be available. – Morgan Jul 17 '14 at 09:37
@Morgan: I understand, but I intended this to be a level of simplification toward that goal. Multiplication logic is a lot easier to approximate and optimize than division logic, so if the OP wants to trade off speed for accuracy, it's a matter of manipulating the constant and the right-shift to fit that demand. – user541686 Jul 17 '14 at 09:46

score 0 · Answer 4 · answered Jul 16 '14 at 18:55

Multiplying by the reciprocal is often a good approach, as you have noted though real numbers are not supported. You need to work with fixed point rather than floating point reals.

Verilog does not have a definition of fixed point, but it it just uses a word length and you decide how many bits are integer and how many fractional.

0.01 (0.0098876953125) in binary would be 0_0000001010001. The bigger this word length the greater the precision.

// 1Int, 13Frac
wire ONE_HUNDREDTH = 14'b0_0000001010001 ; 

input  a         [15:0];    //Integer (no fractional bits)
output result [15+14:0];    //13 fractional bits inherited form ONE_HUNDREDTH
output result_int [15:0];   //Integer result

always @* begin
  result     = ONE_HUNDREDTH * a;
  result_int = result >>> 13;
end

Real to binary conversion done using the ruby gem fixed_point.

A ruby irb session (with fixed_point installed via gem install fixed_point):

require 'fixed_point'
#Unsigned, 1 Integer bit, 13 fractional bits
format  = FixedPoint::Format.new(0, 1, 13)
fix_num = FixedPoint::Number.new(0.01, format )
 => 0.0098876953125
fix_num.to_b
 => "0.0000001010001"

What is the fastest way to perform hardware division of an integer by a fixed constant?

4 Answers4