2

I have an array of interleaved signed 24-bit ints (complex numbers) in little endian order that I would like to convert to a complex array of floats or doubles. By interleaved, I mean:

R1 R2 R3 I1 I2 I3 R4 R5 R6 I4 I5 I6 . . .

Where each item is an 8-bit byte, and each three together are a 24-bit int, with R = real and I = imaginary.

What's the most efficient way to do this in C#? The code has to run many times, so I'm trying to squeeze every last cycle out of it I can. I'm hoping for something more efficient than a brute force shift, or and cast.

I wouldn't mind using unsafe code in this case, if it would help.

Here's the baseline, brute-force approach with the second number of the pair commented out, and with sign handling ignored for the moment, to simplify the IDL:

class Program
{
    const int Size = 10000000;
    static void Main(string[] args)
    {
        //
        // Array of little-endian 24-bit complex ints
        // (least significant byte first)
        //
        byte[] buf = new byte[3 * 2 * Size];
        float[] real = new float[Size];
        //float[] imag = new float[Size];

        //
        // The brute-force way
        //
        int j = 0;
        Stopwatch timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < Size; i++)
        {
            real[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
            j += 3;
            // imag[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
            j += 3;
        }
        timer.Stop();
        Console.WriteLine("result = " + 
            (float)(timer.ElapsedMilliseconds * 1000.0f / Size) + 
            " microseconds per complex number");
        Console.ReadLine();
    }
}

and the associated IDL:

  IL_0024:  ldc.i4.0
  IL_0025:  stloc.s    i
  IL_0027:  br.s       IL_0050
  IL_0029:  ldloc.1
  IL_002a:  ldloc.s    i
  IL_002c:  ldloc.0
  IL_002d:  ldloc.2
  IL_002e:  ldelem.u1
  IL_002f:  ldloc.0
  IL_0030:  ldloc.2
  IL_0031:  ldc.i4.1
  IL_0032:  add
  IL_0033:  ldelem.u1
  IL_0034:  ldc.i4.8
  IL_0035:  shl
  IL_0036:  or
  IL_0037:  ldloc.0
  IL_0038:  ldloc.2
  IL_0039:  ldc.i4.2
  IL_003a:  add
  IL_003b:  ldelem.u1
  IL_003c:  ldc.i4.s   16
  IL_003e:  shl
  IL_003f:  or
  IL_0040:  conv.r4
  IL_0041:  stelem.r4
  IL_0042:  ldloc.2
  IL_0043:  ldc.i4.3
  IL_0044:  add
  IL_0045:  stloc.2
  IL_0046:  ldloc.2
  IL_0047:  ldc.i4.3
  IL_0048:  add
  IL_0049:  stloc.2
  IL_004a:  ldloc.s    i
  IL_004c:  ldc.i4.1
  IL_004d:  add
  IL_004e:  stloc.s    i
  IL_0050:  ldloc.s    i
  IL_0052:  ldc.i4     0x989680
  IL_0057:  blt.s      IL_0029
RickNZ
  • 18,448
  • 3
  • 51
  • 66
  • 1
    As long as the bytes live in memory you'll be forced to perform the shifts anyway (whether you do it in your code or the framework works some magic). There are very few constructs in .NET (or even C++) to help you perform this any more quickly than doing the bit shifting manually. – M.Babcock Jul 18 '13 at 04:17
  • In C or C++, I could use pointers, and copy the three bytes into the lower 32-bits of a union of a float and an int or array of 4 bytes (leaving the mantissa zero) -- or maybe copy all 4 at once as an int, if the alignment is right. That would avoid the multiplies that come with array indexing, the shifts, and the cast. – RickNZ Jul 18 '13 at 04:52
  • 1
    The only random suggestion is to try unpack 12 bytes in iteration so you can just deal with int32 value all the way... – Alexei Levenkov Jul 18 '13 at 04:52
  • @RickNZ - I'm not sure all of those byte copies would be more efficient than a processor-optimized bit shift (IO is IO after all). If you're looking for a less code intensive solution - that will for sure work. Otherwise, I'd recommend some profiling. It almost sounds like premature optimization to me. – M.Babcock Jul 18 '13 at 04:56
  • @RickNZ - On second thought, depending the quantities you're talking about, the memory copies have the potential to become a detrimental bottleneck in your application (assuming this is a detrimental part of the system). If you're looking for throughput then you're absolutely better off reading the bytes as they lay rather than wasting conversion space and time. I'm still 99% sure this is premature optimization for a .NET app but I'm sure this will produce more scalable results. – M.Babcock Jul 18 '13 at 05:09
  • All I/O is not the same, sorry. I've written and profiled code like this many times, which is why I asked the question in the first place. – RickNZ Jul 18 '13 at 05:11
  • @RickNZ If the floats are intended to be IEEE 754 32-bit binary, you will **not** get the right answer by plugging a 24 bit int into three bytes of the float. The fastest way is probably going to be a cast, because that can use hardware conversion, rather than doing shifts that depend on the magnitude and bit masking in software. – Patricia Shanahan Jul 18 '13 at 05:14
  • @RickNZ - Please provide a simplified code sample exemplifying the performance issue so we can be of some assistance. Doing so will help shed some light on your issue. – M.Babcock Jul 18 '13 at 05:30
  • I'm not really expecting to avoid the cast. I'm mainly trying to think of a way to read and write the data as few times as possible and to avoid the multiplies that come with array indexing. Ideally, one read and write to unpack all 24-bits, one read and write to do the cast, and no multiplies. – RickNZ Jul 18 '13 at 05:33
  • C# uses JIT compiler optimization, so you cannot be sure that what will run does all the same steps as the code you posted. Array access inside a simple loop with no function calls is extremely easy to optimize, so I would be surprised if the code that actually runs in non-debug mode is unnecessarily slow. – Patricia Shanahan Jul 18 '13 at 11:51

2 Answers2

1

Late to the party but this looked like fun ;-)

A couple of experiments (using unsafe) below. Method1() is yours. On my laptop, with an AnyCPU Release build, there's a consistent 20%-odd improvement for Method2() and no significant additional benefit for Method3(). (Timed over 100_000_000 iterations.)

I was going for pointers without (explicit) shifting (masking unavoidable).

Some typical results...

result = 0.0075 microseconds per complex number
result = 0.00542 microseconds per complex number
result = 0.00516 microseconds per complex number

result = 0.00753 microseconds per complex number
result = 0.0052 microseconds per complex number
result = 0.00528 microseconds per complex number

Code...

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;

namespace SO_20210326
{
  //  Enable unsafe code
  [StructLayout(LayoutKind.Explicit, Pack = 1, Size = 6)]
  struct NumPair
  {
    [FieldOffset(0)] public int r;
    [FieldOffset(3)] public int i;
  }

  class Program
  {
    const int Size = 100000000;
    static void Method1()
    {
      //
      // Array of little-endian 24-bit complex ints
      // (least significant byte first)
      //
      byte[] buf = new byte[3 * 2 * Size];
      float[] real = new float[Size];
      float[] imag = new float[Size];

      //
      // The brute-force way
      //
      int j = 0;
      Stopwatch timer = new Stopwatch();
      timer.Start();
      for (int i = 0; i < Size; i++)
      {
        real[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
        j += 3;
        imag[i] = (float)(buf[j] | (buf[j + 1] << 8) | (buf[j + 2] << 16));
        j += 3;
      }
      timer.Stop();
      Console.WriteLine("result = " + 
                        (float)(timer.ElapsedMilliseconds * 1000.0f / Size) + 
                        " microseconds per complex number");
    }

    static void Method2()
    {
      NumPair[] buf = new NumPair[Size];
      float[] real = new float[Size];
      float[] imag = new float[Size];

      Stopwatch timer = new Stopwatch();
      timer.Start();
      for (int i = 0; i < Size; i++)
      {
        real[i] = buf[i].r & 0xffffff00;
        imag[i] = buf[i].i & 0xffffff00;

      }
      timer.Stop();
      Console.WriteLine("result = " + 
                        (float)(timer.ElapsedMilliseconds * 1000.0f / Size) + 
                        " microseconds per complex number");
    }

    static void Method3()
    {
      unsafe
      {
        NumPair[] buf = new NumPair[Size];
        float[] real = new float[Size];
        float[] imag = new float[Size];

        Stopwatch timer = new Stopwatch();
        timer.Start();
        fixed (void* pvalue = &buf[0])
        {
          var p = (byte*)pvalue;
          for (int i = 0; i < Size; i++)
          {
            real[i] = *(int*)p & 0xffffff00;
            p += 3;
            imag[i] = *(int*)p & 0xffffff00;
            p += 3;
          }
        }

        timer.Stop();
        Console.WriteLine("result = " + 
                          (float)(timer.ElapsedMilliseconds * 1000.0f / Size) + 
                          " microseconds per complex number");
      }
    }

    static void Main(string[] args)
    {
      Method1();
      Method2();
      Method3();
      Console.ReadLine();
    }
  }
}
AlanK
  • 1,827
  • 13
  • 16
  • 1
    this isn't the correct way to microbenchmark because there are many things that affect the result. You need to warm up first for the JITter to kick in, and make sure that cache effect is the same in all versions. And many other things also need to be taken care. It's very difficult do that. See https://stackoverflow.com/a/15366908/995714. See also https://stackoverflow.com/q/504103/995714 (Java but many points applies to C# as well, because there's no C# version) – phuclv Mar 26 '21 at 14:46
  • Agreed. Just came across the 7 year-old question and offered the techniques. – AlanK Mar 26 '21 at 20:32
0

Bit shifting should be very fast although you'll need to operate in a bigger unit, i.e. int or long, instead of byte. That avoids multiple shifts to combine the three bytes. But that also means you must do an unsafe cast from byte* buf to uint* buf or ulong* buf. Objects in .NET are aligned to 4 or 8 bytes by default, but in a few cases you might need to take care of the alignment issue yourself

uint* dataIn = (uint*)(ref buf);
uint i = 0;
float real[buf.Length/6], imag[buf.Length/6];

for (var data = dataIn; data < dataIn + buf.Length; data += 3)
{
    // Extract (R123, I123) and (R456, I456) at once
    // D1--------- D2--------- D3---------
    // R1 R2 R3 I1 I2 I3 R4 R5 R6 I4 I5 I6
    real[i] = data[0] >> 8;
    imag[i] = ((data[0] & 0xff) << 16) | (data[1] >> 16);
    real[i + 1] = ((data[1] & 0xffff) << 8) | (data[2] >> 24);
    imag[i + 1] = data[2] & 0xffffff;
    i++;
}

Doing that in ulong is also similar, but you'll extract 4 complex numbers in one iteration instead of 2 like above. If the size is not a multiple of 12 then the remaining bytes will be extracted at the end, outside the loop.

You can also run the conversion in multiple threads easily. In newer .NET framework SIMD may be another solution because .NET Core has added many SIMD intrinsics

phuclv
  • 37,963
  • 15
  • 156
  • 475