5

I need fastest way to convert byte array to short array of audio data.

Audio data byte array contains data from two audio channels placed in such way:

C1C1C2C2 C1C1C2C2 C1C1C2C2 ...

where

C1C1 - two bytes of first channel

C2C2 - two bytes of second channel

Currently I use such algorithm, but I feel there is better way to perform this task.

byte[] rawData = //from audio device
short[] shorts = new short[rawData.Length / 2];
short[] channel1 = new short[rawData.Length / 4];
short[] channel2 = new short[rawData.Length / 4];
System.Buffer.BlockCopy(rawData, 0, shorts, 0, rawData.Length);
for (int i = 0, j = 0; i < shorts.Length; i+=2, ++j)
{
    channel1[j] = shorts[i];
    channel2[j] = shorts[i+1];
}
anth
  • 1,724
  • 1
  • 19
  • 22
  • Given the interleaved nature of the data, that looks fine to me. You could save some memory copying by writing a wrapper over `shorts` to give a virtual `channel1` and `channel2`, but memory copies are fast (unless you have a lot of data, then look at the wrapping option). – Tim Lloyd Oct 13 '11 at 09:32
  • 1
    I'll add that if you like to live dangerously you can use the `struct` trick from here: http://stackoverflow.com/q/621493/613130 to skip the BlockCopy. – xanatos Oct 13 '11 at 09:35
  • @xantos That doesn't necessarily make it *faster* which is what the question asks. – Tim Lloyd Oct 13 '11 at 09:46
  • @xanatos Relying on the worst kind of undefined behavior does not sound like a good idea to me. – CodesInChaos Oct 13 '11 at 09:48
  • 1
    @anth why do you need the fastest way? This code will already work at many times real-time speed. – CodesInChaos Oct 13 '11 at 09:49
  • @CodeInChaos I'm not "suggesting" it. He asked for options. I gave him options. I always believed in giving to persons many options and letting them choose the worst :-) – xanatos Oct 13 '11 at 09:53
  • @CodeInChaos I need it fast as possible because my app is running on some mobile pc with relatively poor cpu. But probably you are right, and there is no room to achieve large improvement in this area. Anyway, I need to perform some benchmarks on the proposed solutions. Thanks. – anth Oct 13 '11 at 10:08
  • @anth Let us know which was faster on your target hardware :) – Tim Lloyd Oct 13 '11 at 10:12
  • 1
    If performance is that important I'd seriously consider reusing all your buffers. Especially if they are large enough to land on the LOH(I think this happens at >85kB, but that's an implementation detail). – CodesInChaos Oct 13 '11 at 10:13
  • 1
    You should also think about endianness issues. Some of the code posted here assume native endianness, others assume fixed endianness. Which one is correct depends on the endianness of your input data. – CodesInChaos Oct 13 '11 at 10:20

3 Answers3

4

You can leave out copying the buffer:

byte[] rawData = //from audio device
short[] channel1 = new short[rawData.Length / 4];
short[] channel2 = new short[rawData.Length / 4];
for (int i = 0, j = 0; i < rawData.Length; i+=4, ++j)
{
    channel1[j] = (short)(((ushort)rawData[i + 1]) << 8 | (ushort)rawData[i]);
    channel2[j] = (short)(((ushort)rawData[i + 3]) << 8 | (ushort)rawData[i + 2]);
}

To get the loop faster, you can have a look at the Task Parralel Library, exspecially Parallel.For:

[EDIT]

System.Threading.Tasks.Parallel.For( 0, shorts.Length/2, ( i ) =>
{
    channel1[i] = shorts[i*2];
    channel2[i] = shorts[i*2+1];
} );

[/EDIT]

Another way is loop unrolling, but I think the TPL will boost this up as well.

PVitt
  • 11,500
  • 5
  • 51
  • 85
  • Does that actually speed it up though? – Tim Lloyd Oct 13 '11 at 09:37
  • I don't know. I think you have to do some measurings... But it saves memory. – PVitt Oct 13 '11 at 09:38
  • The OP is asking for the "*fastest way*". – Tim Lloyd Oct 13 '11 at 09:39
  • @chibacity There's really only one way to know if it's fastest: actual measuring. – MPelletier Oct 13 '11 at 09:41
  • @MPelletier Obviously. I'm not the one that is offering it as an answer though. – Tim Lloyd Oct 13 '11 at 09:47
  • @chibacity You save two allocations and one block copy. At least this saves you some cpu cycles. But not that much as the most work is done inside the loop. I hope that this answer shows you obviously enough that it saves performance and is faster than his first attempt. – PVitt Oct 13 '11 at 09:50
  • @PVitt 2 allocations? And how do allocations and block copy save CPU cycles? – Tim Lloyd Oct 13 '11 at 09:53
  • Beg your pardon, but allocating and copying memory does not happen in 0-time. – PVitt Oct 13 '11 at 09:56
  • I'm just questioning to what extent the CPU is involved. – Tim Lloyd Oct 13 '11 at 10:03
  • Are you sure that code is correct? I have a bad feeling regarding sign extension and shifting here. Shouldn't the inner casts be to `ushort`? – CodesInChaos Oct 13 '11 at 10:17
  • @CodeInChaos In my benchmark it gives the correct result (and I'm using big arrays of random bytes). To make it compile without warnings: `channel1d[j] = (short)(((ushort)rawData[i + 1]) << 8 | (ushort)rawData[i]); channel2d[j] = (short)(((ushort)rawData[i + 3]) << 8 | (ushort)rawData[i + 2]);` – xanatos Oct 13 '11 at 10:20
  • @PVitt Actually in x86, this is not much faster than the original solution - it's very close. It comes into its own in x64. – Tim Lloyd Oct 13 '11 at 11:06
3

You could use unsafe code to avoid array addressing or bit shifts. But as PVitt said on new PCs you are better using standard managed code and the TPL if your data size is important.

short[] channel1 = new short[rawData.Length / 4];
short[] channel2 = new short[rawData.Length / 4];

fixed(byte* pRawData = rawData)
fixed(short* pChannel1 = channel1)
fixed(short* pChannel2 = channel2)
{
    byte* end = pRawData + rawData.Length;
    while(pRawData < end)
    {
        (*(pChannel1++)) = *((short*)pRawData);
        pRawData += sizeof(short);
        (*(pChannel2++)) = *((short*)pRawData);
        pRawData += sizeof(short);
    }
}

As with all optimization problems you need to time carefully, pay special attention to your buffer allocations, channel1 and channel2 could be static (big) buffers growing automatically and you could use only the nth first bytes. You will be able to skip 2 big arrays allocations for each executions of this function. and will make the GC work less (always better when timing is important)

As noted by CodeInChaos the endianness could be important, if your data is not in the correct endianness you will need to do the conversion, for example to convert between big and little endian assuming 8bit atomic elements the code will look like :

short[] channel1 = new short[rawData.Length / 4];
short[] channel2 = new short[rawData.Length / 4];

fixed(byte* pRawData = rawData)
fixed(byte* pChannel1 = (byte*)channel1)
fixed(byte* pChannel2 = (byte*)channel2)
{
    byte* end = pRawData + rawData.Length;
    byte* pChannel1High = pChannel1 + 1;
    byte* pChannel2High = pChannel2 + 1;

    while(pRawData < end)
    {
        *pChannel1High = *pRawData;
        pChannel1High += 2 * sizeof(short);

        *pChannel1 = *pRawData;
        pChannel1 += 2 * sizeof(short);

        *pChannel2High = *pRawData;
        pChannel2High += 2 * sizeof(short);

        *pChannel2 = *pRawData;
        pChannel2 += 2 * sizeof(short);
    }
}

I didn't compile any code in this post with an actual compiler so if you find errors feel free to edit it.

Community
  • 1
  • 1
Julien Roncaglia
  • 17,397
  • 4
  • 57
  • 75
  • If they are even quite small arrays (85,000 bytes or more) they will end up on the LOH (Large Object Heap) - this is cheap in terms of GC. – Tim Lloyd Oct 13 '11 at 09:48
  • @chiba Collecting stuff on the LOH is very expensive in my experience, since objects on the LOH only get collected during a Gen2 GC. So cutting down on large allocations is often a big performance gain. – CodesInChaos Oct 13 '11 at 10:12
  • @CodeInChaos Collecting is cheap on the LOH - no compacting and no generation management. Why would collection be expensive? – Tim Lloyd Oct 13 '11 at 10:19
  • As I said it requires a Gen2 GC. And in non trivial applications those are expensive since they need to crawl all managed objects and not just the new ones. Another issue is that since Gen2 collections are rarer than Gen0/1 collections the memory usage will rise alot. – CodesInChaos Oct 13 '11 at 10:22
  • @CodeInChaos Maybe I'm misunderstanding something here. Although LOH objects report being in generation 2, they are not actually "in" the same management structures as objects in the SOH which are reported as generation 2. They get managed at the same time as a generation 2 sweep, but they are entirely separate and managed completely differently. – Tim Lloyd Oct 13 '11 at 10:25
  • Yes, but that implies that to collect this data you need a Gen2 sweep. And that is usually expensive. – CodesInChaos Oct 13 '11 at 10:27
  • @CodeInChaos How expensive it is is highly dependent on how expensive the Gen2 SOH collect is. If there's bugger all in Gen2 SOH, it's all gravy :) – Tim Lloyd Oct 13 '11 at 10:38
  • +1 This is by far the fastest solution so far in both x86 and x64. – Tim Lloyd Oct 13 '11 at 11:15
3

You can benchmark it by yourself! remember to use Release Mode and run without Debug (Ctrl+F5)

class Program
{
    [StructLayout(LayoutKind.Explicit)]
    struct UnionArray
    {
        [FieldOffset(0)]
        public byte[] Bytes;

        [FieldOffset(0)]
        public short[] Shorts;
    }

    unsafe static void Main(string[] args)
    {
        Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;

        byte[] rawData = new byte[10000000];
        new Random().NextBytes(rawData);

        Stopwatch sw1 = Stopwatch.StartNew();

        short[] shorts = new short[rawData.Length / 2];
        short[] channel1 = new short[rawData.Length / 4];
        short[] channel2 = new short[rawData.Length / 4];
        System.Buffer.BlockCopy(rawData, 0, shorts, 0, rawData.Length);
        for (int i = 0, j = 0; i < shorts.Length; i += 2, ++j)
        {
            channel1[j] = shorts[i];
            channel2[j] = shorts[i + 1];
        }

        sw1.Stop();

        Stopwatch sw2 = Stopwatch.StartNew();

        short[] channel1b = new short[rawData.Length / 4];
        short[] channel2b = new short[rawData.Length / 4];

        for (int i = 0, j = 0; i < rawData.Length; i += 4, ++j)
        {
            channel1b[j] = BitConverter.ToInt16(rawData, i);
            channel2b[j] = BitConverter.ToInt16(rawData, i + 2);
        }

        sw2.Stop();

        Stopwatch sw3 = Stopwatch.StartNew();

        short[] shortsc = new UnionArray { Bytes = rawData }.Shorts;
        short[] channel1c = new short[rawData.Length / 4];
        short[] channel2c = new short[rawData.Length / 4];

        for (int i = 0, j = 0; i < shorts.Length; i += 2, ++j)
        {
            channel1c[j] = shortsc[i];
            channel2c[j] = shortsc[i + 1];
        }

        sw3.Stop();

        Stopwatch sw4 = Stopwatch.StartNew();

        short[] channel1d = new short[rawData.Length / 4];
        short[] channel2d = new short[rawData.Length / 4];

        for (int i = 0, j = 0; i < rawData.Length; i += 4, ++j)
        {
            channel1d[j] = (short)((short)(rawData[i + 1]) << 8 | (short)rawData[i]);
            channel2d[j] = (short)((short)(rawData[i + 3]) << 8 | (short)rawData[i + 2]);
            //Equivalent warning-less version
            //channel1d[j] = (short)(((ushort)rawData[i + 1]) << 8 | (ushort)rawData[i]);
            //channel2d[j] = (short)(((ushort)rawData[i + 3]) << 8 | (ushort)rawData[i + 2]);

        }

        sw4.Stop();

        Stopwatch sw5 = Stopwatch.StartNew();

        short[] channel1e = new short[rawData.Length / 4];
        short[] channel2e = new short[rawData.Length / 4];

        fixed (byte* pRawData = rawData)
        fixed (short* pChannel1 = channel1e)
        fixed (short* pChannel2 = channel2e)
        {
            byte* pRawData2 = pRawData;
            short* pChannel1e = pChannel1;
            short* pChannel2e = pChannel2;

            byte* end = pRawData2 + rawData.Length;

            while (pRawData2 < end)
            {
                (*(pChannel1e++)) = *((short*)pRawData2);
                pRawData2 += sizeof(short);
                (*(pChannel2e++)) = *((short*)pRawData2);
                pRawData2 += sizeof(short);
            }
        }

        sw5.Stop();

        Stopwatch sw6 = Stopwatch.StartNew();

        short[] shortse = new short[rawData.Length / 2];
        short[] channel1f = new short[rawData.Length / 4];
        short[] channel2f = new short[rawData.Length / 4];
        System.Buffer.BlockCopy(rawData, 0, shortse, 0, rawData.Length);

        System.Threading.Tasks.Parallel.For(0, shortse.Length / 2, (i) =>
        {
            channel1f[i] = shortse[i * 2];
            channel2f[i] = shortse[i * 2 + 1];
        });

        sw6.Stop();


        if (!channel1.SequenceEqual(channel1b) || !channel1.SequenceEqual(channel1c) || !channel1.SequenceEqual(channel1d) || !channel1.SequenceEqual(channel1e) || !channel1.SequenceEqual(channel1f))
        {
            throw new Exception();
        }

        if (!channel2.SequenceEqual(channel2b) || !channel2.SequenceEqual(channel2c) || !channel2.SequenceEqual(channel2d) || !channel2.SequenceEqual(channel2e) || !channel2.SequenceEqual(channel2f))
        {
            throw new Exception();
        }

        Console.WriteLine("Original: {0}ms", sw1.ElapsedMilliseconds);
        Console.WriteLine("BitConverter: {0}ms", sw2.ElapsedMilliseconds);
        Console.WriteLine("Super-unsafe struct: {0}ms", sw3.ElapsedMilliseconds);
        Console.WriteLine("PVitt shifts: {0}ms", sw4.ElapsedMilliseconds);
        Console.WriteLine("unsafe VirtualBlackFox: {0}ms", sw5.ElapsedMilliseconds);
        Console.WriteLine("TPL: {0}ms", sw6.ElapsedMilliseconds);
        Console.ReadKey();
        return;
    }
}
Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • VirtualBlackFox, followed by PVitt is definitely faster on my machine once things have been warmed up. – Tim Lloyd Oct 13 '11 at 10:11
  • @chibacity Are you running your program in Release mode WITHOUT the debugger? (CTRL-F5). The differences are big. But yes, I would say that PVitt is the "best" "safe" solution. – xanatos Oct 13 '11 at 10:12
  • @xantos Yes I'm running it properly :) Updated my comment with VirtualBlackFox being fastest after adding to test suite. – Tim Lloyd Oct 13 '11 at 10:15
  • @Xantos No, in your original version you put your version as fastest, it's actually 3rd *behind* PVitt :P – Tim Lloyd Oct 13 '11 at 10:27
  • @chibacity The unsafe struct trick is faster on my machine than PVitt. It could depend on array size. – xanatos Oct 13 '11 at 10:31
  • @xantos How many consequetive runs are you doing - are you issues a `GC.Collect();GC.WaitForPendingFinalizers();` between each test? Over 3 runs on my machine, yours in 3rd. – Tim Lloyd Oct 13 '11 at 10:36
  • @chibacity Tried in a `while (true)` with the GC in the end. same result. But the difference is VERY small. (around 10%). So PVitt IS better – xanatos Oct 13 '11 at 10:39
  • @xantos I'm getting PVitt ~25% faster than yours (`Elapsed.TotalMilliseconds`). – Tim Lloyd Oct 13 '11 at 10:44
  • @chibacity Perhaps you are running 64 bits? I see a 25% difference when runnin the Debug code without the Debugger. But in the end, I wouldn't use the `struct` version. And it isn't "my" version :-) I don't want to be mixed with that hack :-) :-) – xanatos Oct 13 '11 at 10:51
  • That's very interesting, the scratch project I was using was indeed x64. When I change to x86, the unsafe version goes from ~5ms up to ~7ms, and your version becomes ~25% faster than PVitt, but goes from ~8.5ms up to ~12.5ms. The difference in x86 and x64 is quite large. Using x86 PVitt is not that much better than the original solution - it's minimal. – Tim Lloyd Oct 13 '11 at 11:03
  • unsafe solution is about 40% faster for me than others. Thanks for great feedback. – anth Oct 13 '11 at 11:40
  • +1 For actually measuring the solutions relative to each other. – Tim Lloyd Oct 13 '11 at 12:28