I'm in the process of optimizing the encoding step of a C++ library called PackJPG
I've profiled the code with Intel VTune and found that the current bottleneck is the following function in the arithmetic coder that PackJPG uses:
void aricoder::encode( symbol* s )
{
// update steps, low count, high count
unsigned int delta_plus_one = ((chigh - clow) + 1);
cstep = delta_plus_one / s->scale;
chigh = clow + ( cstep * s->high_count ) - 1;
clow = clow + ( cstep * s->low_count );
// e3 scaling is performed for speed and to avoid underflows
// if both, low and high are either in the lower half or in the higher half
// one bit can be safely shifted out
while ( ( clow >= CODER_LIMIT050 ) || ( chigh < CODER_LIMIT050 ) ) {
if ( chigh < CODER_LIMIT050 ) { // this means both, high and low are below, and 0 can be safely shifted out
// write 0 bit
write_zero();
// shift out remaing e3 bits
write_nrbits_as_one();
}
else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
// write 1 bit
write_one();
clow &= CODER_LIMIT050 - 1;
chigh &= CODER_LIMIT050 - 1;
// shift out remaing e3 bits
write_nrbits_as_zeros();
}
clow <<= 1;
chigh = (chigh << 1) | 1;
}
// e3 scaling, to make sure that theres enough space between low and high
while ( ( clow >= CODER_LIMIT025 ) && ( chigh < CODER_LIMIT075 ) ) {
++nrbits;
clow &= CODER_LIMIT025 - 1;
chigh ^= CODER_LIMIT025 + CODER_LIMIT050;
// clow -= CODER_LIMIT025;
// chigh -= CODER_LIMIT025;
clow <<= 1;
chigh = (chigh << 1) | 1;
}
}
This function seems to borrow some ideas from: http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf. I've managed to optimize the function somewhat (primarily by speeding up the bit writing) but now I'm stuck.
Right now the biggest bottleneck seems to be division at the beginning. This screenshot from VTune shows the time it takes results as well as the assembly created (the blue assembly to the right corresponds to the line in the source code selected to the left).
s->scale is not necessarily an even power of 2 so the division can't be replaced with a modulo operation.
The code was compiled with MSVC (from Visual Studio 2013) with the following settings:
/GS /Qpar- /GL /analyze- /W3 /Gy- /Zc:wchar_t /Zi /Gm- /Ox /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "PACKJPG_EXPORTS" /D "_CRT_SECURE_NO_WARNINGS" /D "BUILD_DLL" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:IA32 /Gd /Oy- /Oi /MT /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\PackJPG.pch"
Any ideas on how to optimize this further?
UPDATE 1 I've now tried all suggestions so far and this is the fastest version now:
void aricoder::encode( symbol* s )
{
unsigned int clow_copy = clow;
unsigned int chigh_copy = chigh;
// update steps, low count, high count
unsigned int delta_plus_one = ((chigh_copy - clow_copy) + 1);
unsigned register int cstep = delta_plus_one / s->scale;
chigh_copy = clow_copy + (cstep * s->high_count) - 1;
clow_copy = clow_copy + (cstep * s->low_count);
// e3 scaling is performed for speed and to avoid underflows
// if both, low and high are either in the lower half or in the higher half
// one bit can be safely shifted out
while ((clow_copy >= CODER_LIMIT050) || (chigh_copy < CODER_LIMIT050)) {
if (chigh_copy < CODER_LIMIT050) { // this means both, high and low are below, and 0 can be safely shifted out
// write 0 bit
write_zero();
// shift out remaing e3 bits
write_nrbits_as_one();
}
else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
// write 1 bit
write_one();
clow_copy &= CODER_LIMIT050 - 1;
chigh_copy &= CODER_LIMIT050 - 1;
// shift out remaing e3 bits
write_nrbits_as_zeros();
}
clow_copy <<= 1;
chigh_copy = (chigh_copy << 1) | 1;
}
// e3 scaling, to make sure that theres enough space between low and high
while ((clow_copy >= CODER_LIMIT025) & (chigh_copy < CODER_LIMIT075)){
++nrbits;
clow_copy &= CODER_LIMIT025 - 1;
chigh_copy ^= CODER_LIMIT025 + CODER_LIMIT050;
// clow -= CODER_LIMIT025;
// chigh -= CODER_LIMIT025;
clow_copy <<= 1;
chigh_copy = (chigh_copy << 1) | 1;
}
clow = clow_copy;
chigh = chigh_copy;
}
Here's the updated VTune results with this version:
This new version includes the following changes:
- Avoid one branch by using & instead of && in the last while loop (that trick did not help in the first loop).
- Copy the class fields to local variables.
The following suggestions unfortunately did not improve performance:
- Replacing the first while loop with a switch with goto statements.
- Using fixed point arithmetic for division (it created rounding errors).
- Doing a switch on s->scale and doing bit-shifts instead of division for even powers of 2.
@example suggested that it's not the division that's slow but the memory access for one of the operands of the division. That seems to be correct. According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?