7

I am working on a fixed-point platform (floating-point arithmetic not supported).

I represent any rational number q as the floor value of q * (1 << precision).

I need an efficient method for calculating log base 2 of x, where 1 < x < 2.

Here is what I've done so far:

uint64_t Log2(uint64_t x, uint8_t precision)
{
    uint64 res = 0;

    uint64 one = (uint64_t)1 << precision;
    uint64 two = (uint64_t)2 << precision;

    for (uint8_t i = precision; i > 0 ; i--)
    {
        x = (x * x) / one; // now 1 < x < 4
        if (x >= two)
        {
            x >>= 1; // now 1 < x < 2
            res += (uint64_t)1 << (i - 1);
        }
    }

    return res;
}

This works well, however, it takes a toll on the overall performance of my program, which requires executing this for a large amount of input values.

For all it matters, the precision used is 31, but this may change so I need to keep it as a variable.

Are there any optimizations that I can apply here?

I was thinking of something in the form of "multiply first, sum up last".

But that would imply calculating x ^ (2 ^ precision), which would very quickly overflow.

Update

I have previously tried to get rid of the branch, but it just made things worse:

for (uint8_t i = precision; i > 0 ; i--)
{
    x = (x * x) / one; // now 1 < x < 4
    uint64_t n = x / two;
    x >>= n; // now 1 < x < 2
    res += n << (i - 1);
}

return res;
halfer
  • 19,824
  • 17
  • 99
  • 186
goodvibration
  • 5,980
  • 4
  • 28
  • 61
  • 1
    Duplicate? https://stackoverflow.com/questions/4657468/fast-fixed-point-pow-log-exp-and-sqrt – GManNickG Aug 03 '17 at 22:32
  • Why the division by `one`? Isn't `x = (x * x) >> precision` equivalent? – rodrigo Aug 03 '17 at 22:34
  • @GManNickG: Not really. I see stuff similar to what I wrote, but no suggestion on how to optimize that loop (may that's just not possible, but I'd still like to read some opinions about it). – goodvibration Aug 03 '17 at 22:35
  • @rodrigo: Yes. Don't worry about performance of division vs shift-right. On my underlying platform, they are equivalent. – goodvibration Aug 03 '17 at 22:36
  • If precision is 32 and `x` is between 1 and 2, then the fixed point representation of `x` has 33 bits. Doing `(x * x)` will result in 66 bits overflowing the `uint64_t` type. – rodrigo Aug 03 '17 at 22:57
  • 1
    @rodrigo: Sorry, you got me. The actual platform supports `uint256_t`, but I did not want to involve this here (and start getting unrelated questions about the platform). So feel free to reconsider `precision=31`, and I will fix it up in the question. Thank you. – goodvibration Aug 03 '17 at 23:05
  • Enough memory for a look-up table? – deamentiaemundi Aug 03 '17 at 23:22
  • 1
    What is this platform? Are you actually synthesizing this for an FPGA or something like that? – harold Aug 04 '17 at 00:08

2 Answers2

1

The only things I can think of is to do the loop with a right-shift instead of a decrement and change a few operations to their equivalent binary ops. That may or may not be relevant to your platform, but in my x64 PC they yield an improvement of about 2%:

uint64_t Log2(uint64_t x, uint8_t precision)
{
    uint64_t res = 0;
    uint64_t two = (uint64_t)2 << precision;

    for (uint64_t b = (uint64_t)1 << (precision - 1); b; b >>= 1)
    {
        x = (x * x) >> precision; // now 1 < x < 4
        if (x & two)
        {
            x >>= 1; // now 1 < x < 2
            res |= b;
        }
    }
    return res;
}
rodrigo
  • 94,151
  • 12
  • 143
  • 190
1

My proposal would go from opposite direction -- into a use of a constant-performance at fixed number of steps.

Given a reasonable small amount of resources will still suffice and the precision target is known and always reached, the constant-performance deployment can beat most iterative schemes.

A Taylor expansion ( since 1715 ) of log2(x) provides both a solid calculus basement plus (almost) infinite precision a-priori known to be feasible for any depth of fixed-point arithmetics ( be it for Epiphany / FPGA / ASIC / you keep it private / ... )

Math transforms the whole problem into an optionally small amount of a few node points X_tab_i, for which ( as few as platform precision requires ) constants are pre-calculated for each node point. The rest is a platform-efficient assembly of Taylor sum of products, granting the result is obtained both in constant-time + having a residual error under design-driven threshold ( the target PSPACE x PTIME constraints tradeoff here is obvious for design phase, yet the process is always a CTIME, CSPACE once deployed )

Voilá:

Given X: lookup closest  X_tab_i,
                with    C0_tab_i, C1_tab_i, C2_tab_i, .., Cn_tab_i
//-----------------------------------------------------------------<STATIC/CONST>
//            ![i]
#DEFINE  C0_tab_i  <log2( X_tab_i )>
#DEFINE  C1_tab_i  <    ( X_tab_i )^(-1) * ( +1         / ( 1 * ln(2) )>
#DEFINE  C2_tab_i  <    ( X_tab_i )^(-2) * ( -1         / ( 2 * ln(2) )>
#DEFINE  C3_tab_i  <    ( X_tab_i )^(-3) * ( +1         / ( 3 * ln(2) )>
:::       :                           :                     :
#DEFINE  CN_tab_i  <    ( X_tab_i )^(-N) * ( -1^(N-1) ) / ( N * ln(2) )>

// -----------------------------------------------------------------<PROCESS>-BEG
DIFF  = X - X_tab_i;     CORR  = DIFF;
RES   = C0_tab_i
      + C1_tab_i * CORR; CORR *= DIFF;
RES  += C2_tab_i * CORR; CORR *= DIFF;
...  +=
RES  += Cn_tab_i * CORR; CORR *= DIFF;
// --------------------------------------------------------------<PROCESS>-END:
user3666197
  • 1
  • 6
  • 50
  • 92