4

I wonder if there is any fast method to do a 24 bit to 16 bit quantization on an array of audio samples (using intrinsics or asm).

Source format is signed 24 le.

Update : Managed to get the conversion done like described :

static void __cdecl Convert24bitToStereo16_SSE2(uint8_t* src, uint8_t* dst, int len)
{
    __m128i shuffleMask = _mm_setr_epi8(-1,0,1,2,-1,3,4,5,-1,6,7,8,-1,9,10,11);             

    __asm 
  {    
        mov        eax, [src]   // src          
        mov        edi, [dst]   // dst
        mov        ecx, [len]   // len

        movdqu     xmm0,xmmword ptr [shuffleMask]           

      convertloop:
        movdqu     xmm1, [eax]              // read 4 samples           
        lea        eax,  [eax + 12]         // inc pointer                      
        pshufb     xmm1,xmm0                // shuffle using mask
        psrldq     xmm1, 2                  // shift right

        movdqu     xmm2, [eax]              // read next 4 samples          
        lea        eax,  [eax + 12]         // inc pointer                      
        pshufb     xmm2, xmm0               // shuffle
        psrldq     xmm2, 2                  // shift right
        packusdw   xmm1, xmm2               // pack upper and lower samples

        movdqu     [edi], xmm1              // write 8 samples
        lea        edi, [edi + 16]
        sub        ecx, 24
        jg         convertloop
  }
}

Now for the dithering - how to avoid quantization effects ?

Any hint is welcome. Thx

ohrfritz
  • 41
  • 2
  • 1
    24 to 16 bits is pretty straightforward - you load three 128 bit values, then you shuffle (`_mm_shuffle_epi8`) bytes to drop each third byte, and eventually store two 128 bit values as the result. Slightly more complicated if you need accurate rounding. – Roman R. May 02 '15 at 21:47
  • @RomanR. I don't think that is going to take care of the dithering problem. – Brad May 04 '15 at 00:09
  • What kind of dithering do you want to apply ? – Paul R May 04 '15 at 15:36

1 Answers1

3

Your final code looks weird. Why shuffle and then do a bytewise shift of the entire register? Instead, set up you shuffle control mask to put things in the right place to start with.

Also, packusdw doesn't convert full-range 32bit to full-range 16bit. It saturates (to 0xffff) any 32bit element greater than 2^16-1. So you have to right-shift the data yourself, to go from 24bit full range to 16bit full range. (In audio, the conversion from 16 to 24 bits is done by adding 8 zero bits as least-signifcant bits, not most-significant.)

Anyway, the implication of this is that we want to pack the high 16b of every 24bits of input back-to-back. We can just do this with a shuffle.

//__m128i shuffleMask = _mm_setr_epi8(-1,0,1,2,-1,3,4,5,-1,6,7,8,-1,9,10,11);
// setr takes its args in reverse order, so right-shift by 2 bytes -> move the first 2 args
//__m128i shiftedMask = _mm_setr_epi8(1,2,-1,3,4,5,-1,6,7,8,-1,9,10,11,-1,-1);

// could get 10B, but packing that into the output would be slower
__m128i mask_lo = _mm_setr_epi8( 1,2,  4,5,   7,8,   10,11,
                                -1,-1, -1,-1, -1,-1, -1,-1);
//    __m128i mask_hi = _mm_setr_epi8(-1,-1, -1,-1, -1,-1, -1,-1,
//                                     1,2,  4,5,   7,8,   10,11);
//  generate this from mask_lo instead of using more storage space  

  ... pointer setup
  movdqu     xmm3, xmmword ptr [mask_lo]
  pshufd     xmm4, xmm3, 0x4E  // swap high/low halves

  convertloop:
    movdqu     xmm0, [eax]              // read 4 samples
    pshufb     xmm0, xmm3               // low 8B = 24->16 of first 12B, high8 = 0
    movdqu     xmm1, [eax + 12]         // read next 4 samples
    pshufb     xmm1, xmm4               // high 8B = 2nd chunk of audio, low8 = 0
    por        xmm1, xmm0               // merge the two halves

    movdqu     [edi], xmm1              // write 8 samples
    add        eax, 24
    lea        edi, [edi + 16]
    sub        ecx, 24
    jg         convertloop

Also, be careful about reading past the end of the array. Each movdqu reads 16B, but you only use the first 12.

I could have used the same mask twice, and used PUNPCKLQDQ to put the high 8B into the top half of the reg holding the low 8B. However, punpck instructions compete for the same port as pshufb. (ports 1, 5 on Nehalem/Sandybridge/IvyBridge, port 5 only on Haswell.) por can run on any of ports 0,1,5, even on Haswell, so it doesn't create a port5 bottleneck problem.

Loop overhead is too high without unrolling to saturate port5 even on Haswell, but it's close. (9 fused-domain uops, 2 of them requiring port5. There's no loop-carried dependency, and enough of the uops are loads/stores that 4uops per cycle should be possible.) Unrolling by 2 or 3 should do the trick. Nehalem/Sandybridge/Ivybridge won't bottleneck on execution ports, since they can shuffle on two ports. Core2 takes 4 uops for PSHUFB, and can only sustain 1 per 2 cycles, but it's still the fastest way to do this data movement. Penryn (aka wolfdale) should be fast for this too, but I haven't looked at details. Decoder throughput will be an issue on pre-Nehalem, though.

So if everything's in L1 cache, we can generate 16B of 16b audio per 2 cycles. (Or less, with some unrolling, on pre-Haswell.)

AMD CPUs (e.g. Steamroller) also have pshufb on the same port as punpck, while booleans can run on either of the other 2 vector ports, so it's the same situation. Shuffles are higher latency than on Intel, but throughput is still 1 per cycle.

If you want proper rounding instead of truncation, add something like 2^7 to the samples before truncation. (Probably requiring some sign-adjustment.) If you want dithering, you need something even more complex, and should google that up, or look for a library implementation. Audacity is open source, so you could look at how they do it.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847