Your final code looks weird. Why shuffle and then do a bytewise shift of the entire register? Instead, set up you shuffle control mask to put things in the right place to start with.
Also, packusdw
doesn't convert full-range 32bit to full-range 16bit. It saturates (to 0xffff) any 32bit element greater than 2^16-1. So you have to right-shift the data yourself, to go from 24bit full range to 16bit full range. (In audio, the conversion from 16 to 24 bits is done by adding 8 zero bits as least-signifcant bits, not most-significant.)
Anyway, the implication of this is that we want to pack the high 16b of every 24bits of input back-to-back. We can just do this with a shuffle.
//__m128i shuffleMask = _mm_setr_epi8(-1,0,1,2,-1,3,4,5,-1,6,7,8,-1,9,10,11);
// setr takes its args in reverse order, so right-shift by 2 bytes -> move the first 2 args
//__m128i shiftedMask = _mm_setr_epi8(1,2,-1,3,4,5,-1,6,7,8,-1,9,10,11,-1,-1);
// could get 10B, but packing that into the output would be slower
__m128i mask_lo = _mm_setr_epi8( 1,2, 4,5, 7,8, 10,11,
-1,-1, -1,-1, -1,-1, -1,-1);
// __m128i mask_hi = _mm_setr_epi8(-1,-1, -1,-1, -1,-1, -1,-1,
// 1,2, 4,5, 7,8, 10,11);
// generate this from mask_lo instead of using more storage space
... pointer setup
movdqu xmm3, xmmword ptr [mask_lo]
pshufd xmm4, xmm3, 0x4E // swap high/low halves
convertloop:
movdqu xmm0, [eax] // read 4 samples
pshufb xmm0, xmm3 // low 8B = 24->16 of first 12B, high8 = 0
movdqu xmm1, [eax + 12] // read next 4 samples
pshufb xmm1, xmm4 // high 8B = 2nd chunk of audio, low8 = 0
por xmm1, xmm0 // merge the two halves
movdqu [edi], xmm1 // write 8 samples
add eax, 24
lea edi, [edi + 16]
sub ecx, 24
jg convertloop
Also, be careful about reading past the end of the array. Each movdqu
reads 16B, but you only use the first 12.
I could have used the same mask twice, and used PUNPCKLQDQ
to put the high 8B into the top half of the reg holding the low 8B. However, punpck
instructions compete for the same port as pshufb
. (ports 1, 5 on Nehalem/Sandybridge/IvyBridge, port 5 only on Haswell.) por
can run on any of ports 0,1,5, even on Haswell, so it doesn't create a port5 bottleneck problem.
Loop overhead is too high without unrolling to saturate port5 even on Haswell, but it's close. (9 fused-domain uops, 2 of them requiring port5. There's no loop-carried dependency, and enough of the uops are loads/stores that 4uops per cycle should be possible.) Unrolling by 2 or 3 should do the trick. Nehalem/Sandybridge/Ivybridge won't bottleneck on execution ports, since they can shuffle on two ports. Core2 takes 4 uops for PSHUFB
, and can only sustain 1 per 2 cycles, but it's still the fastest way to do this data movement. Penryn (aka wolfdale) should be fast for this too, but I haven't looked at details. Decoder throughput will be an issue on pre-Nehalem, though.
So if everything's in L1 cache, we can generate 16B of 16b audio per 2 cycles. (Or less, with some unrolling, on pre-Haswell.)
AMD CPUs (e.g. Steamroller) also have pshufb
on the same port as punpck
, while booleans can run on either of the other 2 vector ports, so it's the same situation. Shuffles are higher latency than on Intel, but throughput is still 1 per cycle.
If you want proper rounding instead of truncation, add something like 2^7 to the samples before truncation. (Probably requiring some sign-adjustment.) If you want dithering, you need something even more complex, and should google that up, or look for a library implementation. Audacity is open source, so you could look at how they do it.