Initial value for Newton-Raphson Division algorithm when input is close to 0

Question

I am implementing a Newton-Raphson Division algorithm for values from [-16:16] in S15.16 over an FPGA. For values from |[1:16]| I achieved a MSE of 10e-9with 3 iterations. The way that I have initialized the value a0 is doing the inverse of the middle point in each range:

$a_0 = \frac{2}{Min+Max}$

Some examples are:

From range 1 <= x < 2: a0 = 2/3
From range 6 <= x < 7: a0 = 2/13
From range 15 <= x < 16: a0 = 2/31

This approximation works well, as can be see in the following plot:

So, the problem here is in the range contained within [0:1]. How to find the optimal initial value or an approximation for the initial value?

In the wikipedia says that:

For the subproblem of choosing an initial estimate X0, it is convenient to apply a bit-shift to the divisor D to scale it so that 0.5 ≤ D ≤ 1; by applying the same bit-shift to the numerator N, one ensures the quotient does not change. Then one could use a linear approximation in the form

$X_{0}=T_{1}+T_{2}D\approx {\frac {1}{D}}\$

to initialize Newton–Raphson. To minimize the maximum of the absolute value of the error of this approximation on interval [0.5,1], one should use

$X_{0}={48 \over 17}-{32 \over 17}D.\$

Ok, this approximation works well for the range [0.5:1], but:

whats happens when the value tends to get smaller, close to 0. E.g: 0.1, 0.01, 0.001, 0.00001... and so on? I see a problem here because I think is necessary a initial value for each range between [0.001:0.01], [0.01:0.1]... etc.
For smaller values, is better apply other algorithm such as Goldschmidt division algorithm?

These are the codes that I have implemented to emulate the Newton-Raphson division in fixed point:

i = 0
# 16 fractional bits
SHIFT = 2 ** 16 
# Lut with "optimal?" values to init NRD algorithm
LUT = np.round(1 / np.arange(0.5, 16, 1) * SHIFT).astype(np.int64)
LUT_f = 1 / np.arange(0.5, 16, 1)
# Function to simulates the NRD algorithm in S15.16 over a FPGA
def FIXED_RECIPROCAL(x):
    # Smart adressing to the initial iteration value
    adress = x >> 16
    # Get the value from LUT
    a0 = LUT[adress]
    # Algorithm with only 3 iterations
    for i in range(3):
        s1 = (a0*a0) >> 16
        s2 = (x*s1) >> 16
        a0 = (a0 << 1) - s2
    # Return rescaled value (Only for analysis purposes)
    return(a0 / SHIFT)

# ----- TEST ----- #
t = np.arange(1, 16, 0.1)
teor = 1 / t
t_fixed = (t * SHIFT).astype(np.int32)
prac = np.zeros(len(t))
for value in t_fixed:
    prac[i] = FIXED_RECIPROCAL(value)
    i = i + 1

# Get and print Errors
errors = abs(prac - teor)
mse = ((prac - teor)**2).mean(axis=None)

print("Max Error : %s" % ('{:.3E}'.format(np.max(errors))))
print("MSE:      : %s" % ('{:.3E}'.format(mse)))

# Print the obtained values:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot(t, teor, label='Theorical division')
plt.plot(t, prac, '.', label='Newton-Raphson Division')
plt.legend(fontsize=16)
plt.title('Float 32 theorical division Vs. S15.16 Newton-Raphson division', fontsize=22)
plt.xlabel('x', fontsize=20)
plt.ylabel('1 / x', fontsize=20)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

How wide a multiplier can you afford? For a reasonably straightforward NR-based S15.16 division you will need a 32x32->64 bit multiplier, as up to 31 correct quotient bits need to be generated. With smaller multipliers, a high-radix long-hand division might be a good alternative. Generally, as Wikipedia states, you will have to normalize the divisor via "count leading zeros" (a priority encoder) followed by a barrel shifter. — njuffa, Feb 04 '21 at 23:41
Hi @njuffa My goal is implement a set of activation function for embedded neural networks over FPGA: Sigmoid, Hyperbolic Tangent and Exponential. By the moment I have implemented the Sigmoid and Tanh with CRI approximations and the Exponential via LUT and works well, But I want to experiment whats happens implementing the Exponential via Hyperbolic Tangent, so, the neccesarry accuracy should be large, in the order of E-4 or more, because 1-tanh(x) is close to zero when x is greater than 4 or more. I think is not a good idea try to implement the exponential via hyperbolic tangent... — Diego Ruiz, Feb 05 '21 at 07:41
For CRI approximation of Sigmoid and Tanh I have followed this papers: https://www.researchgate.net/publication/228618304_Digital_Implementation_of_The_Sigmoid_Function_for_FPGA_Circuits https://core.ac.uk/download/pdf/187739765.pdf And to implement the Exponential function I have done in this way (this is not the final version...) https://stackoverflow.com/questions/65806509/implement-exponential-function-via-lut — Diego Ruiz, Feb 05 '21 at 07:45
Implementing `exp` via `tanh` seems backwards. `tan` can be computed from `exp2` with a reciprocal and a few multiplies. For the computation of `exp2` one can use bitwise computation, table interpolation, or polynomial approximation, as I demonstrated in [this answer](https://stackoverflow.com/questions/64941736/efficient-implementation-of-fixed-point-power-pow-function-with-argument-const/64978770#64978770) — njuffa, Feb 08 '21 at 21:37
Hi @nfujja. Thanks for your answer. I have checked your Newton-Raphson algorithm in and works perfectly, but finally I have decided implement exponential function via table interpolation with a ROM of 2¹⁰ x 20 bits for Intel FPGA Altera. In this way only I need one multiplier and the ROM's :-) At the moment only I have implemented it for values from 0 to 4, First search in the rom the value from 0 to 1 that belongs to the fractional part and later I get the value of the integer part with a simple if stament. (Multiplexer) . I Think this is the better way to my app. — Diego Ruiz, Feb 14 '21 at 20:18

njuffa · Accepted Answer · 2021-02-20T23:05:47.483

The ISO-C99 code below demonstrates how one can achieve an almost correctly rounded implementation of Newton-Raphson based s15.16 division, using a 2 kilobit lookup table for the initial reciprocal approximation, and a number of 32x32-bit multipliers capable of delivering both the low 32 bits of the full product and the high 32 bits. For ease of implementation the signed s15.16 division is mapped back to unsigned 16.16 division for operands in [0, 2³¹].

We need to make full use of the 32-bit data path throughout by keeping operands normalized. In essence we are converting the computation into a quasi floating-point format. This requires a priority encoder to find the most significant set bit in both the dividend and the divisor in the initial normalization step. For software convenience, this is mapped to a CLZ (count leading zeros) operation, present in many processors, in the code below.

After computing the reciprocal of the divisor b we multiply by the dividend a to determine the raw quotient q = (1/b)*a. In order to round correctly to nearest or even we need to compute the remainder for the quotient a well as its increment and decrement. The correctly rounded quotient corresponds to the candidate quotient with the remainder of smallest magnitude.

In order for this to work perfectly, we would need a raw quotient that is within 1 ulp of the mathematical result. Unfortunately, this is not the case here, since the raw quotient is occasionally off by ±2 ulps. We would need effectively 33 bits in some of the intermediate computation, which could be simulated in software but I don't have time to puzzle this out right now. The code "as-is" delivers the correctly rounded result in more than 99.999% of random test cases.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define TAB_BITS_IN   (8) /* 256 entry LUT */
#define TAB_BITS_OUT  (9) /* 9 bits effective, 8 bits stored */
#define TRUNC_COMP    (1) /* compensate truncation in fixed-point multiply */

int clz (uint32_t a);  // count leadzing zeros: a priority encoder
uint32_t umul32_hi (uint32_t a, uint32_t b); // upper half of 32x32-bit product

/* i in [0,255]: (int)(1.0 / (1.0 + 1.0/512.0 + i / 256.0) * 512 + .5) & 0xff 
   In a second step tuned to minimize the number of incorrect results with the
   specific implementation of the two refinement steps chosen.
*/
static uint8_t rcp_tab[256] = 
{
    0xff, 0xfd, 0xfb, 0xf9, 0xf7, 0xf5, 0xf3, 0xf1,
    0xf0, 0xee, 0xec, 0xea, 0xe8, 0xe6, 0xe5, 0xe3,
    0xe1, 0xdf, 0xdd, 0xdc, 0xda, 0xd8, 0xd7, 0xd5,
    0xd3, 0xd2, 0xd0, 0xce, 0xcd, 0xcb, 0xc9, 0xc8,
    0xc6, 0xc5, 0xc3, 0xc2, 0xc0, 0xbf, 0xbd, 0xbc,
    0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb1, 0xb0,
    0xae, 0xad, 0xac, 0xaa, 0xa9, 0xa7, 0xa6, 0xa5,
    0xa4, 0xa2, 0xa1, 0x9f, 0x9e, 0x9d, 0x9c, 0x9a,
    0x99, 0x98, 0x96, 0x95, 0x94, 0x93, 0x91, 0x90,
    0x8f, 0x8e, 0x8c, 0x8b, 0x8a, 0x89, 0x88, 0x87,
    0x86, 0x84, 0x83, 0x82, 0x81, 0x80, 0x7f, 0x7e,
    0x7c, 0x7b, 0x7a, 0x79, 0x78, 0x77, 0x76, 0x74,
    0x74, 0x73, 0x71, 0x71, 0x70, 0x6f, 0x6e, 0x6d,
    0x6b, 0x6b, 0x6a, 0x68, 0x67, 0x67, 0x66, 0x65,
    0x64, 0x63, 0x62, 0x61, 0x60, 0x5f, 0x5e, 0x5d,
    0x5c, 0x5b, 0x5b, 0x59, 0x58, 0x58, 0x56, 0x56,
    0x55, 0x54, 0x53, 0x52, 0x51, 0x51, 0x50, 0x4f,
    0x4e, 0x4e, 0x4c, 0x4b, 0x4b, 0x4a, 0x48, 0x48,
    0x48, 0x46, 0x46, 0x45, 0x44, 0x43, 0x43, 0x42,
    0x41, 0x40, 0x3f, 0x3f, 0x3e, 0x3d, 0x3c, 0x3b,
    0x3b, 0x3a, 0x39, 0x38, 0x38, 0x37, 0x36, 0x36,
    0x35, 0x34, 0x34, 0x33, 0x32, 0x31, 0x30, 0x30,
    0x2f, 0x2e, 0x2e, 0x2d, 0x2d, 0x2c, 0x2b, 0x2a,
    0x2a, 0x29, 0x28, 0x27, 0x27, 0x26, 0x26, 0x25,
    0x24, 0x23, 0x23, 0x22, 0x21, 0x21, 0x21, 0x20,
    0x1f, 0x1f, 0x1e, 0x1d, 0x1d, 0x1c, 0x1c, 0x1b,
    0x1a, 0x19, 0x19, 0x19, 0x18, 0x17, 0x17, 0x16,
    0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11,
    0x11, 0x10, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0d,
    0x0c, 0x0c, 0x0b, 0x0b, 0x0a, 0x0a, 0x09, 0x08,
    0x08, 0x07, 0x07, 0x07, 0x06, 0x05, 0x05, 0x04,
    0x04, 0x03, 0x03, 0x02, 0x02, 0x01, 0x01, 0x01
};

/* Divide two u16.16 fixed-point operands each in [0, 2**31]. Attempt to round 
   the result to nearest of even. Currently this does not always succeed. We
   would need effectively 33 bits in intermediate computation for that, so the
   raw quotient is within +/- 1 ulp of the mathematical result.
*/
uint32_t div_core (uint32_t a, uint32_t b)
{
    /* normalize dividend and divisor to [1,2); bit 31 is the integer bit */
    uint8_t lza = clz (a);
    uint8_t lzb = clz (b);
    uint32_t na = a << lza;
    uint32_t nb = b << lzb;
    /* LUT is addressed by most significant fraction bits of divisor */
    uint32_t idx = (nb >> (32 - 1 - TAB_BITS_IN)) & 0xff;
    uint32_t rcp = rcp_tab [idx] | 0x100; // add implicit msb
    /* first NR iteration */
    uint32_t f = (rcp * rcp) << (32 - 2*TAB_BITS_OUT);
    uint32_t p = umul32_hi (f, nb);
    rcp = (rcp << (32 - TAB_BITS_OUT)) - p;
    /* second NR iteration */
    rcp = rcp << 1;
    p = umul32_hi (rcp, nb);
    rcp = umul32_hi (rcp, 0 - p);
    /* compute raw quotient as (1/b)*a; off by at most +/- 2ulps */
    rcp = (rcp << 1) | TRUNC_COMP;
    uint32_t quot = umul32_hi (rcp, na);
    uint8_t shift = lza - lzb + 15;
    quot = (shift > 31) ? 0 : (quot >> shift);
    /* round quotient using 4:1 mux */
    uint32_t ah = a << 16;
    uint32_t prod = quot * b;
    uint32_t rem1 = abs (ah - prod);
    uint32_t rem2 = abs (ah - prod - b);
    uint32_t rem3 = abs (ah - prod + b);
    int sel = (((rem2 < rem1) << 1) | ((rem3 < rem1) & (quot != 0)));
    switch (sel) {
    case 0:
    default:
        quot = quot;
        break;
    case 1: 
        quot = quot - 1;
        break;
    case 2: /* fall through */
    case 3: 
        quot = quot + 1;
        break;
    }
    return quot;
}

int32_t div_s15p16 (int32_t a, int32_t b)
{
    uint32_t aa = abs (a);
    uint32_t ab = abs (b);
    uint32_t quot = div_core (aa, ab);
    quot = ((a ^ b) & 0x80000000) ? (0 - quot) : quot;
    return (int32_t)quot;
}

uint64_t umul32_wide (uint32_t a, uint32_t b)
{
    return ((uint64_t)a) * b;
}

uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(umul32_wide (a, b) >> 32);
}

#define VARIANT  (1)
int clz (uint32_t a)
{
#if VARIANT == 1
    static const uint8_t clz_tab[32] = {
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0
    };
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acddu * a >> 27] + (!a);
#elif VARIANT == 2
    uint32_t b;
    int n = 31 + (!a);
    if ((b = (a & 0xffff0000u))) { n -= 16;  a = b; }
    if ((b = (a & 0xff00ff00u))) { n -=  8;  a = b; }
    if ((b = (a & 0xf0f0f0f0u))) { n -=  4;  a = b; }
    if ((b = (a & 0xccccccccu))) { n -=  2;  a = b; }
    if ((    (a & 0xaaaaaaaau))) { n -=  1;         }
    return n;
#elif VARIANT == 3
    int n = 0;
    if (!(a & 0xffff0000u)) { n |= 16;  a <<= 16; }
    if (!(a & 0xff000000u)) { n |=  8;  a <<=  8; }
    if (!(a & 0xf0000000u)) { n |=  4;  a <<=  4; }
    if (!(a & 0xc0000000u)) { n |=  2;  a <<=  2; }
    if ((int32_t)a >= 0) n++;
    if ((int32_t)a == 0) n++;
    return n;
#elif VARIANT == 4
    uint32_t b;
    int n = 32;
    if ((b = (a >> 16))) { n = n - 16;  a = b; }
    if ((b = (a >>  8))) { n = n -  8;  a = b; }
    if ((b = (a >>  4))) { n = n -  4;  a = b; }
    if ((b = (a >>  2))) { n = n -  2;  a = b; }
    if ((b = (a >>  1))) return n - 2;
    return n - a;
#endif
}

uint32_t div_core_ref (uint32_t a, uint32_t b)
{
    int64_t quot = ((int64_t)a << 16) / b;
    /* round to nearest or even */
    int64_t rem1 = ((int64_t)a << 16) - quot * b;
    int64_t rem2 = rem1 - b;
    if (llabs (rem2) < llabs (rem1)) quot++;
    if ((llabs (rem2) == llabs (rem1)) && (quot & 1)) quot &= ~1;
    return (uint32_t)quot;
}

// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC  ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
              kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    uint64_t count = 0ULL, stats[3] = {0ULL, 0ULL, 0ULL};
    uint32_t a, b, res, ref;
    do {
        /* random dividend and divisor, avoiding overflow and divison by zero */
        do {
            a = KISS % 0x80000001u;
            b = KISS % 0x80000001u;
        } while ((b == 0) || ((((uint64_t)a << 16) / b) > 0x80000000ULL));

        /* compute function under test and reference result */
        ref = div_core_ref (a, b);
        res = div_core (a, b);

        if (llabs ((int64_t)res - (int64_t)ref) > 1) {
            printf ("\nerror: a=%08x b=%08x res=%08x ref=%08x\n", a, b, res, ref);
            break;
        } else {
            stats[(int64_t)res - (int64_t)ref + 1]++;
        }
        count++;
        if (!(count & 0xffffff)) {
            printf ("\r[-1]=%llu  [0]=%llu  [+1]=%llu", stats[0], stats[1], stats[2]);
        }
    } while (count);
    return EXIT_SUCCESS;
}

Hi @njuffa Thanks. I will study the algorithm during this weekend and I will try to see how to optimize it for hardware purposes! — Diego Ruiz, Feb 05 '21 at 10:20
@DiegoRuiz A lot of modern FPGAs have ready-to-use multiplier blocks and configurable ROM resources, so it should not be too hard to get this implemented. The rest of the data path is adders, comparators, multiplexors, the priority encoder, barrel shifters, and some wire shifts :-) The last time I designed for an FPGA was in 1992, the last time I worked on processor hardware was in 1999. If your FPGA is very minimal, that would be worth mentioning in your questions, because bit-wise processing would likely be a better match in that case. — njuffa, Feb 05 '21 at 10:28
The FPGA is a Cyclone V series, that has between 55k and 110k ALMs, so there is not a problem with algorithms of activation functions or division. The really heavy side of the application is the *Matrix Multiplication kernel*, that uses a *BLOCK SIZE* strategy to vectorizes the matrix multiplication. So, the problem is the replication of the logic such as division in the BLOCK SIZE architecture. This is why I try to optimize to the maximun the hardware resources. In the other hand, NN's usually work with small numeric ranges and I think I don't need all the numeric range. — Diego Ruiz, Feb 05 '21 at 10:51

Initial value for Newton-Raphson Division algorithm when input is close to 0

1 Answers1