Optimizing an arithmetic coder

Question

I'm in the process of optimizing the encoding step of a C++ library called PackJPG

I've profiled the code with Intel VTune and found that the current bottleneck is the following function in the arithmetic coder that PackJPG uses:

void aricoder::encode( symbol* s )
{   
    // update steps, low count, high count
    unsigned int delta_plus_one = ((chigh - clow) + 1);
    cstep = delta_plus_one / s->scale;
    chigh = clow + ( cstep * s->high_count ) - 1;
    clow  = clow + ( cstep * s->low_count );

    // e3 scaling is performed for speed and to avoid underflows
    // if both, low and high are either in the lower half or in the higher half
    // one bit can be safely shifted out
    while ( ( clow >= CODER_LIMIT050 ) || ( chigh < CODER_LIMIT050 ) ) {
        if ( chigh < CODER_LIMIT050 ) { // this means both, high and low are below, and 0 can be safely shifted out
            // write 0 bit
            write_zero();
            // shift out remaing e3 bits
            write_nrbits_as_one();

        }
        else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
            // write 1 bit
            write_one();
            clow  &= CODER_LIMIT050 - 1;
            chigh &= CODER_LIMIT050 - 1;
            // shift out remaing e3 bits

            write_nrbits_as_zeros();
        }
        clow  <<= 1;
        chigh = (chigh << 1) | 1;

    }

    // e3 scaling, to make sure that theres enough space between low and high
    while ( ( clow >= CODER_LIMIT025 ) && ( chigh < CODER_LIMIT075 ) ) {
        ++nrbits;
        clow  &= CODER_LIMIT025 - 1;
        chigh ^= CODER_LIMIT025 + CODER_LIMIT050;
        // clow  -= CODER_LIMIT025;
        // chigh -= CODER_LIMIT025;
        clow  <<= 1;
        chigh = (chigh << 1) | 1;

    }
}

This function seems to borrow some ideas from: http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf. I've managed to optimize the function somewhat (primarily by speeding up the bit writing) but now I'm stuck.

Right now the biggest bottleneck seems to be division at the beginning. This screenshot from VTune shows the time it takes results as well as the assembly created (the blue assembly to the right corresponds to the line in the source code selected to the left).

s->scale is not necessarily an even power of 2 so the division can't be replaced with a modulo operation.

The code was compiled with MSVC (from Visual Studio 2013) with the following settings:

/GS /Qpar- /GL /analyze- /W3 /Gy- /Zc:wchar_t /Zi /Gm- /Ox /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "PACKJPG_EXPORTS" /D "_CRT_SECURE_NO_WARNINGS" /D "BUILD_DLL" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:IA32 /Gd /Oy- /Oi /MT /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\PackJPG.pch"

Any ideas on how to optimize this further?

UPDATE 1 I've now tried all suggestions so far and this is the fastest version now:

void aricoder::encode( symbol* s )
{   
    unsigned int clow_copy = clow;
    unsigned int chigh_copy = chigh;
    // update steps, low count, high count
    unsigned int delta_plus_one = ((chigh_copy - clow_copy) + 1);
    unsigned register int cstep = delta_plus_one / s->scale;

    chigh_copy = clow_copy + (cstep * s->high_count) - 1;
    clow_copy = clow_copy + (cstep * s->low_count);

    // e3 scaling is performed for speed and to avoid underflows
    // if both, low and high are either in the lower half or in the higher half
    // one bit can be safely shifted out
    while ((clow_copy >= CODER_LIMIT050) || (chigh_copy < CODER_LIMIT050)) {
        if (chigh_copy < CODER_LIMIT050) {  // this means both, high and low are below, and 0 can be safely shifted out
            // write 0 bit
            write_zero();
            // shift out remaing e3 bits
            write_nrbits_as_one();

        }
        else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
            // write 1 bit
            write_one();
            clow_copy &= CODER_LIMIT050 - 1;
            chigh_copy &= CODER_LIMIT050 - 1;
            // shift out remaing e3 bits

            write_nrbits_as_zeros();
        }
        clow_copy <<= 1;
        chigh_copy = (chigh_copy << 1) | 1;

    }

    // e3 scaling, to make sure that theres enough space between low and high
    while ((clow_copy >= CODER_LIMIT025) & (chigh_copy < CODER_LIMIT075)){
        ++nrbits;
        clow_copy &= CODER_LIMIT025 - 1;
        chigh_copy ^= CODER_LIMIT025 + CODER_LIMIT050;
        // clow  -= CODER_LIMIT025;
        // chigh -= CODER_LIMIT025;
        clow_copy <<= 1;
        chigh_copy = (chigh_copy << 1) | 1;

    }
    clow = clow_copy;
    chigh = chigh_copy;
}

Here's the updated VTune results with this version: This new version includes the following changes:

Avoid one branch by using & instead of && in the last while loop (that trick did not help in the first loop).
Copy the class fields to local variables.

The following suggestions unfortunately did not improve performance:

Replacing the first while loop with a switch with goto statements.
Using fixed point arithmetic for division (it created rounding errors).
Doing a switch on s->scale and doing bit-shifts instead of division for even powers of 2.

@example suggested that it's not the division that's slow but the memory access for one of the operands of the division. That seems to be correct. According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?

A common alternative to avoid division is to use multiply with `1/x`. Some compilers do this automatically (at least, I've seen gcc, Borland's, and VC do that for integers) -- but I'm wondering if this day and age the gain is as large as it used to be. The first `jnb` seems to use an inappropriate amount of time as well. Would reordering the conditions work here? — Jongware, Apr 10 '14 at 15:16
I was looking at the generated code and noticed several references to `dword ptr [esi+0x10]` which leads me to believe this was not compiled with optimization enabled? With optimization I expect the compiler to load that into a register. — amdn, Apr 10 '14 at 15:22
I agree that it looks odd but it was compiled with full optimization (/Ox). I've added the full compiler settings to the question now. — Yrlec, Apr 10 '14 at 15:27
@Jongware For integer `x`, `1/x` exist only if it is a root of unity, i.e. if `x` is odd. On top of that, the optimization is only done if `x` is a compile time constant. Bit shift optimization is also possible, but only if the operands are unsigned. — KevinZ, Apr 10 '14 at 16:34
This article is about lz4 decode rather than arithmetic encode but it might give you some ideas, it's a good read anyway: http://cbloomrants.blogspot.ca/2013/10/10-14-13-oodle-fast-lz4.html — mattnewport, Apr 10 '14 at 18:53
It is possible do perform fast division, using fixed point multiplication. See my answer for more details. — alexander, Apr 10 '14 at 19:10
In the assembly output it says, that storing the result in memory is what is taking time in that codeline, not the actual division. or am i mistaken? Probably caused by pagefaults. Maybe you can change the memory layout to fix this. — example, Apr 10 '14 at 20:39
You could try to read all necessary class variables into local variables at the begining of function and store modified variables at the end. — alexander, Apr 10 '14 at 21:07
Great minds think alike! I'm just in the process of doing that. Will update with the results. — Yrlec, Apr 10 '14 at 21:14
I dont know the domain very well but you could possibly minimize the memory access impact if symbols are contiguous in memory, their data members ordered appropriately, and alignment of each set so that fetches don't have to be done in sequence. You've probably already done this, and I expect the gain would be minimal but I put it out there just in case. — qeadz, Apr 14 '14 at 18:28
What is the range of both your divisor and dividend? Can either get very large? — Apriori, Apr 14 '14 at 19:43
@Apriori: delta_plus_one can be pretty much any 32 bit int. I have never seen s->scale get larger than 3000. — Yrlec, Apr 14 '14 at 19:52
So much for look up tables then. If the division is slow because of the memory access to the divisor and not the division itself you could do a couple things. 1) you could try moving the divisor into a value that will get stored in a register so that the register operand division is generated rather than the one that operates on memory. Then you may be able to see which part is slow from VTune more easily, though it's still hard to say. Maybe a better way would be just replace the division by a multiplication to see if it's still slow, even though the results will be incorrect. — Apriori, Apr 14 '14 at 23:30
2) If it is slow because the memory read. Where is the object that `s` points to coming from? Are all the objects that `s` ever points to allocated in contagious memory and passed to encode in the order they appear in the buffer? If not can you make it so? If this function is called in repetition over such a buffer this should help optimize your memory-read situation as then most of the time this value will be in the cache. — Apriori, Apr 14 '14 at 23:31

score 4 · Accepted Answer · edited May 23 '17 at 12:33

According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?

The way we organize data directly impacts the performance as data locality and hence how cache mechanism would behave depend on this. So to achieve this our program should try to do linear memory access as much as possible and should avoid any indirect memory read/write(pointer based data structure). This would really be loved by cache mechanism, as the probability of memory in having the L1 cache would significantly higher.

While looking your code and VTune report, it looks like the most important data is argument passed to this particular function. The various data members of this objects is getting used(memory read) within this particular function.

void aricoder::encode( symbol* s )

Now, there is following code where program is accessing the data members of this object:

s->scale
s->high_count
s->low_count

From both VTune report, we can verify that all three memory access have different timing. This indicates that these data are at different offset of this particular object. And while accessing the one of them(s->high_count), it is going out from L1 cache and hence it is taking more time as it has to bring the data into cache. Due to this the s->low_count is benefiting as it is now in L1 cache. From these data I can think the following point:

Put your most accessed data members into the hot zone of within your object. This means we should put all these members to the first/top of object. By this way we would be in better chance that our object fits into the first cache line of an object. So we should try to re-organize our object memory layout as per its data members access. I assume that your are not dealing with the virtual table in this object as they are not so good from cache mechanism.
It is possible that your overall program is organized in such a way that around this point(.i.e the execution of this function), the L1 cache is full and hence program is trying to access it from L2 and this transition, there would be more CPU cycles(spike). In this scenario I do not think we can do much as this is kind of limitation of machine and in some sense we are stretching our boundary too much and trying to deal with too low level stuff.
Your object s seems to be of type of POD and hence there would be linear access. This is good and there is no scope of improvement. However the way we allocates may have impact on cache mechanism. If it is getting allocated everytime, it can have impact while executing within the current function.

Apart from that I think we should also refer about the following SO post which talks about these concepts in great detail about(Data Cache/ Instruction Cache). These post also have great link which has in-depth analysis and information about this.

What is "cache-friendly" code?

How to write instruction cache friendly program in c++?

I suggest that, you should try to refer these post. They would be really really helpful to understand internals about these concepts even though it might not help you out to optimize your current piece of code. May be your program is already optimized and there is very little we can do in this :).

alexander · Answer 2 · 2014-04-10T20:13:46.943

3

This is not complete answer. This code is a demonstration of usage fixed point arithmetic to perform fast integer division. Widely used in DSP and signal processing. Note, the code make sense for optimization only if 'scale' changes are infrequent. Also, in case of small values of 'scale', code could be rewritten to use uint32_t as intermediate result.

#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
   uint32_t scale;
   uint32_t scale_inv;
   uint32_t delta_plus_one;
   uint32_t val0, val1;
   uint64_t tmp;

   scale = 5;
   delta_plus_one = 44533;

   /* Place the line in 'scale' setter function */
   scale_inv = 0x80000000 / scale;

   /* Original expression */
   val0 = (delta_plus_one / scale);

   /* Division using multiplication uint64_t by uint32_t,
      using uint64_t as intermediate result */
   tmp = (uint64_t)(delta_plus_one) * scale_inv;
   /* shift right to produce result */
   val1 = tmp >> 31;

   printf("val0 = %u; val1 = %u\n", val0, val1);
   return 0;
}

edited Apr 10 '14 at 20:13

answered Apr 10 '14 at 19:06

alexander

2,703
18
16

Great idea but I can't get it to work. Some results are the same as before but some of them are off by one. E.g. delta_plus_one = 993602304 and s->scale = 25 – Yrlec Apr 10 '14 at 19:46
1

Generally, when dealing with fixed point, need to be prepared to precision loss and overflow. If these errors have a significant impact on the algorithm, then fixed point is not suitable for algorithm. – alexander Apr 10 '14 at 20:12
Well, since this arithmetic coder is supposed to be lossless so I guess it's not an option then. – Yrlec Apr 10 '14 at 20:16
You could try 'scale_inv=0xffffffff/scale' or 'scale_inv=(uint64_t)0x100000000/scale' and shift 'val1=tmp>>32;' – alexander Apr 10 '14 at 20:21

score 1 · Answer 3 · answered Apr 10 '14 at 16:18

1

To start off CODER_LIMIT050 is a stupid name, made especially stupid by the coexistence of CODER_LIMIT025 and CODER_LIMIT075. Other than that, you probably don't want to use short circuit logic if there are no side effects anyway, so the second while statement can be:

while ( ( clow >= CODER_LIMIT025 ) & ( chigh < CODER_LIMIT075 ) )

The first while block can be further optimized to collapse the 3 possible branching statements per iteration into one:

start:
switch ( ( clow >= CODER_LIMIT050 ) | (( chigh < CODER_LIMIT050 )<<1) )
{
default: break;

case 1:
    write_zero ( );
    write_nrbits_as_one ( );
    clow <<= 1;
    chigh = ( chigh << 1 ) | 1;
    goto start;

case 3: // think about this case, is this what you want?
case 2:
    write_one ( );
    clow &= CODER_LIMIT050 - 1;
    chigh &= CODER_LIMIT050 - 1;
    write_nrbits_as_zeros ( );
    clow <<= 1;
    chigh = ( chigh << 1 ) | 1;
    goto start;
}

If you want to optimize away the division by s->scale, ask yourself just exactly how variable is it? If there are only a few possible cases, then template it out. Once it is a compile time constant, the compiler can try to either find a bit shift if possible or find its multiplicative inverse in the Galois Field GF(4294967296) if it has one.

answered Apr 10 '14 at 16:18

KevinZ

3,036
1
18
26

@amdn Executing the comparison is cheaper than the branch. If you are going for performance, always try to have 0 side effect comparisons which would allow you to use `&` and `|` over `&&` and `||`. – KevinZ Apr 10 '14 at 16:39
Not sure the compiler will execute the comparison without a branch, but it is possible. – amdn Apr 10 '14 at 16:49
@amdn The comparison function itself does not branch. The statements that can cause branches include `&&`, `||`, `?:`, `if`, `else if`, `switch`, `while`, `do while`, and the middle statement of `for`. – KevinZ Apr 10 '14 at 17:00
extern int foo(); extern int bar(); bool flag = foo() > bar(); // compiler either generates a compare and branch or for x86 perhaps conditional move and subtract (which might be slower than compare and branch), if tricky it might subtract and extract overflow flag, but I doubt it. – amdn Apr 10 '14 at 17:17
I just tested it, on x86 gcc generates `cmp` followed by `setl`, forgot about `setl`... I guess it depends on the target machine whether a branch is necessary. – amdn Apr 10 '14 at 17:27
Great ideas! The second while-loop became faster after skipping the short-circuiting. Unfortunately the switch made the first loop slower which was a a bit surprising. – Yrlec Apr 10 '14 at 17:49

Optimizing an arithmetic coder

3 Answers3

Linked