Need for fast data demuxing in C# by using multi-threading, AVX, GPU or whatever

Question

I have a pretty simple function that demultiplex data acquired from a board. So data comes by frames, each frame made up by multiple signals, as a 1-dim array and I need to convert to jagged arrays, one for each signal. Basically the following:

I'm working in C# but I've a core function in C that does the work for a single signal:

void Functions::Demux(short*pMux, short* pDemux, int nSignals, int signalIndex, int nValuesPerSignal)
{
    short* pMuxStart = pMux + signalIndex;
    for (size_t i = 0; i < nValuesPerSignal; i++)
        *pDemux++ = *(pMuxStart + i * nSignals);
}

and then I call it via C++/CLI (using pin_ptr<short>, so no copy) from C# and with parallel for:

Parallel.For(0, nSignals, (int i) =>
{
    Core.Functions.Demux(muxed, demuxed[i], nSignals, i, nFramesPerSignal);
});

The muxed data comes from 16k signals (16bit resolution), each signal has 20k samples/s, which turns into a data rate of 16k * 20k * 2 = 640MB/s. When running the code on a workstation with 2 Xeon E5-2620 v4 (in total 16 cores @2.1GHz) it takes about 115% to demux (for 10s of data it takes 11.5s).

I need to go down at least to half the time. Does anybody knows of some way, perhaps with AVX technology, or better of some high performance library for this? Or maybe is there a way by using GPU? I would prefer not to improve the CPU hardware because that will cost probably more.

Edit Please consider that nSignals and nValuesPerSignal can change and that the interleaved array must be split in nSignals separate arrays to be further handled in C#.

Edit: further tests

In the meantime, following the remark from Cody Gray, I tested with a single core by:

void _Functions::_Demux(short*pMux, short** pDemux, int nSignals, int nValuesPerSignal)
{
    for (size_t i = 0; i < nSignals; i++)
    {
        for (size_t j = 0; j < nValuesPerSignal; j++)
            pDemux[i][j] = *pMux++;
    }
}

called from C++/CLI:

int nSignals = demuxValues->Length;
int nValuesPerSignal = demuxValues[0]->Length;

pin_ptr<short> pMux = &muxValues[0];

array<GCHandle>^ pins = gcnew array<GCHandle>(nSignals);
for (size_t i = 0; i < nSignals; i++)
    pins[i] = GCHandle::Alloc(demuxValues[i], GCHandleType::Pinned);

try
{
    array<short*>^ arrays = gcnew array<short*>(nSignals);
    for (int i = 0; i < nSignals; i++)
        arrays[i] = static_cast<short*>(pins[i].AddrOfPinnedObject().ToPointer());

    pin_ptr<short*> pDemux = &arrays[0];

    _Functions::_Demux(pMux, pDemux, nSignals, nValuesPerSignal);
}
finally
{ foreach (GCHandle pin in pins) pin.Free();    }

and I obtain a computing time of about 105%, which is too much but clearly shows that Parallel.For is not the right choice. From your replies I guess the only viable solution is SSE/AVX. I never wrote code for that, could some of you instruct me in the right direction for this? I think we can suppose that the processor will support always AVX2.

Edit: my initial code vs Matt Timmermans solution

On my machine I compared my initial code (where I was simply using Parallel.For and calling a C function responsible to deinterlace a single signal) with the code proposed by Matt Timmermans (still using Parallel.For but in a cleverer way). See results (in ms) against the number of tasks used in the Parallel.For (I have 32 threads):

N.Taks   MyCode   MattCode
4        1649     841
8        997      740
16       884      497
32       810      290

So performances are much improved. However, I'll still do some tests on the AVX idea.

You are very likely shooting yourself in the foot here using `Parallel.For`. The overhead of additional threads is far too great for the cost of this simple operation. The C++ code should be processing multiple signals at a time in parallel, and you should be letting the far more powerful C++ optimizer automatically vectorize the code. The C# JIT compiler won't do that. If it's still too slow, then you could consider manually using SSE2 instructions via intrinsics. — Cody Gray - on strike, Dec 14 '16 at 09:15
Hand-writing an SSE-based implementation will be more difficult if you have to deal with arbitrary numbers of values per signal. In your diagram, there are always 3. Is that an assumption that can be made? Or do you really need to parameterize on this? — Cody Gray - on strike, Dec 14 '16 at 09:31
SIMD would be quite good for this, but you might need to code a separate routine for each possible value of nSignals. — Paul R, Dec 14 '16 at 09:32
If you can't even cope with deinterleaving, how can you process this data then ? — , Dec 14 '16 at 11:32
@YvesDaoust: as Cody said, I'm probably shooting myself with Parallel.For because according to my tests, the following things I need to do with the deinterleaved data (few algo and visualization of only a few of signals, not all) seem to work with around half (very rough estimation) of my resources. The bottleneck at the moment seems the deinterleaving step. — Mauro Ganswer, Dec 14 '16 at 12:05
Many ways to get this wrong. On the top of that list is not *measuring* so you have no idea what the bottleneck might be. And assuming that parallelism can solve this problem when it is actually the databus that's the limitation, you have only one of those and it is easily saturated by multiple processor cores. Use the proper tooling, spend the money on it, you'll very quickly earn it back. — Hans Passant, Dec 14 '16 at 12:12
@HansPassant I was not thining to the databus indeed. Ok, let's say I change my hw, can you please tell me what I should look for to select the proper tooling? — Mauro Ganswer, Dec 14 '16 at 12:20
@MauroGanswer: can't you consider not deinterleaving or just deinterleaving the relevant parts, then ? — , Dec 14 '16 at 13:02
@YvesDaoust yes it is an option but I would have to change a lot of code and data processing logic and I would really like to avoid that. In that case I would prefer improving the hw. — Mauro Ganswer, Dec 14 '16 at 13:22
SSE or AVX should be a viable option, but you will have to roll up your sleaves (loading 48 bytes at a time in 3 registers and shuffling to 3 others). If your datasets are big, cache misses may become a limiting factor. — , Dec 14 '16 at 13:35
This looks similar to deinterlacing of RGB. [This answer](http://stackoverflow.com/a/15377386/3962537) has a handy presentation on doing that with SIMD, might be useful. — Dan Mašek, Dec 15 '16 at 13:27
[Xeon E5-2620](http://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI) is a Sandybridge 6 core CPU (with hyperthreading). Since you say 2.1GHz, 16 cores, and that you can assume AVX2, maybe you mean a [Xeon E5-2620 v4](https://ark.intel.com/products/92986/Intel-Xeon-Processor-E5-2620-v4-20M-Cache-2_10-GHz)? Broadwell with 8 cores per socket. This is a very important difference! Sandybridge doesn't have AVX2, so lane-crossing shuffle options are very limited. — Peter Cordes, Dec 16 '16 at 05:01
What C compiler? You did enable optimizations for the final test, right? (The one that loops over the input once, scattering to the demux arrays). BTW, looping that way should be a major improvement in cache-friendliness. All the loads and stores happen in sequential streams. Having multiple output streams is fine, especially if they're in separate 4k pages so the prefetcher can keep track of them all. — Peter Cordes, Dec 16 '16 at 05:24
How often does the function run? i.e. what array size? Your L2 cache size is 256kiB per core, and your L3 cache size is 20MB per socket, shared among cores. Ideally your demuxed data is still hot in L2 when processed by the same thread that demuxed it, so the stores and reloads of demuxed data don't have to compete for shared L3/DRAM bandwidth. With many demux streams, maybe one thread should demux 2 or 4 streams, and another thread making a pass over the same data can demux another 2 or 4 streams, so you aren't splitting L2 of one core among 12 streams or something. Hrm. — Peter Cordes, Dec 16 '16 at 05:32
@PeterCordes yes is Xeon E5-2620 v4 and I have 2 procesors, hence 16 cores in total (I modified the text to make it clear). I use Visual Studio. Array size of the muxed data is (At the moment) variable length because it comes from an intermediate buffer and the consumer (the one that has to convert muxed to demuxed) takes all data available. However this could be changed to fixed data size. — Mauro Ganswer, Dec 16 '16 at 11:37

Cody Gray - on strike · Answer 1 · 2016-12-14T11:13:31.250

As I mentioned in a comment, you are very likely shooting yourself in the foot here by using Parallel.For. The overhead of multiple threads is far too great for the cost of this simple operation. If you need raw speed so much that you're dropping down to implement this in C++, you shouldn't be using C# at all for anything performance-critical.

Rather, you should be letting the C++ code process multiple signals at a time, in parallel. A good C++ compiler has a far more powerful optimizer than the C# JIT compiler, so it should be able to automatically vectorize the code, allowing you to write something readable yet fast. Compiler switches allow you to easily to indicate which instruction sets are available on your target machine(s): SSE2, SSSE3, AVX, AVX2, etc. The compiler will automatically emit the appropriate instructions.

If this is still not fast enough, you could consider manually writing the code using intrinsics to cause the desired SIMD instructions to be emitted. It is unclear from your question how variable the input is. Is the number of frames constant? What about the number of values per signal?

Assuming that your input looks exactly like the diagram, you could write the following implementation in C++, taking advantage of the PSHUFB instruction (supported by SSSE3 and later):

static const __m128i mask = _mm_setr_epi8(0, 1,   6,  7,  12, 13,
                                          2, 3,   8,  9,  14, 15,
                                          4, 5,  10, 11);

void Demux(short* pMuxed, short* pDemuxed, size_t count)
{
   for (size_t i = 0; i <= (count % 8); ++i)
   {
      _mm_store_si128((__m128i*)pDemuxed,
                      _mm_shuffle_epi8(_mm_load_si128((const __m128i*)pMuxed),
                                       mask));
      pMuxed   += 8;
      pDemuxed += 8;
   }
}

In a 128-bit SSE register, we can pack in 8 different 16-bit short values. Therefore, inside of the loop, this code loads the next 8 shorts from the input array, re-shuffles them so that they're in the desired order, and then stores the resulting sequence back into the output array. It has to loop enough times to do this for all sets of 8 shorts in the input array, so we do it count % 8 times.

The resulting assembly code is something like the following:

    mov     edx,  DWORD PTR [esp+12]       ; load parameters into registers (count)
    mov     ecx,  DWORD PTR [esp+8]        ; (pMuxed)
    mov     eax,  DWORD PTR [esp+4]        ; (pDemuxed)
    movdqa  xmm0, XMMWORD PTR [mask]       ; load 'mask' into XMM register
    and     edx,  7                        ; count % 8
    sub     ecx,  eax
    inc     edx

 Demux:
    movdqa  xmm1, XMMWORD PTR [ecx+eax]    ; load next 8 shorts from input array
    pshufb  xmm1, xmm0                     ; re-shuffle them
    movdqa  XMMWORD PTR [eax], xmm1        ; store these 8 shorts in output array

    add     eax, 16                        ; increment pointer
    dec     edx                            ; decrement counter...
    jne     Demux                          ;  and keep looping if necessary

(I wrote this code assuming that the input and output arrays are both aligned on 16-byte boundaries, which allows aligned loads and stores to be used. On older processors, this will be faster than unaligned loads; on newer generations of processors, the penalty for unaligned loads is virtually non-existent. This is easy to ensure and enforce in C/C++, but I'm not sure how you are allocating the memory for these arrays in the C# caller. If you control allocation, then you should be able to control alignment. If not, or you're only targeting late generations of processors that do not penalize unaligned loads, you may alter the code to do unaligned loads instead. Use the _mm_storeu_si128 and _mm_loadu_si128 intrinsics, which will cause MOVDQU instructions to be emitted, instead of MOVDQA.)

There are only 3 SIMD instructions inside of the loop, and the required loop overhead is minimal. This should be relatively fast, although there are almost certainly ways to make it even faster.

One significant optimization would be to avoid repeatedly loading and storing the data. In particular, to avoid storing the output into an output array. Depending on what you're going to do with the demuxed output, it would be more efficient to just leave it in an SSE register and work with it there. However, this won't interop well (if at all) with managed code, so you are very constrained if you must pass the results back to a C# caller.

To write truly efficient SIMD code, you want to have a high computation to load/store ratio. In other words, you want to do a lot of manipulation on the data in between the loads and stores. Here, you are only doing one operation (the shuffle) on the data in between the load and the store. Unfortunately, there is no way around that, unless you can interleave the subsequent "processing" code. Demuxing requires only one operation, meaning that your bottleneck will inevitably be the time it takes to read the input and write the output.

Another possible optimization would be manually unrolling the loop. But this introduces a number of potential complications and requires you to know something about the nature of your input. If the input arrays are generally short, unrolling doesn't make sense. If they are sometimes short and sometimes long, unrolling still may not make sense, because you'll have to explicitly deal with the case where the input array is short, breaking out of the loop early. If the input arrays are always rather long, then unrolling may be a performance win. Not necessarily, though; as mentioned above, the loop overhead here is pretty minimal.

If you need to parameterize based on the number of frames and number of values per signal, you will very likely have to write multiple routines. Or, at the very least, a variety of different masks. This will significantly increase the complexity of the code, and thus the maintenance cost (and potentially also the performance, since the instructions and data you need are less likely to be in the cache), so unless you can really do something that is significantly more optimal than a C++ compiler, you should consider letting the compiler generate the code.

I doubt that this answers the question. The OP must be asking about deinterleaving a contiguous stream as three separate contiguous streams. — , Dec 14 '16 at 11:42
I'm not familiar with SSE/AVX instructions, and hence I'll need to dig more in your proposal. As @YvesDaoust said a contiguous stream must be split in nSignals separate streams that must return back to C# (see also my edit). This unfortunately is required. That said, do you think that your proposal is still feasible? — Mauro Ganswer, Dec 14 '16 at 12:02
Well, yeah, you can modify the code to do that. Pass in pointers to the output arrays (probably a pointer to pointers), process a chunk at a time, and write it into the correct array. I don't see why it would matter that it is a contiguous stream. The code you have currently is going to stop receiving input at some arbitrary point, process it, and then return. That's no different than how you'd do it here. The more complication you add, though, the less efficient the code becomes. If I had a better test case, I could benchmark it. — Cody Gray - on strike, Dec 14 '16 at 12:05
@CodyGray ok I'll try to dig more into your proposal. As a test case, you could use nSignals=16k, nValuesPerSignal = 20k (1 sec of data), so an array of 320M short values, which should be split into 16k separate arrays each of 20k short values. — Mauro Ganswer, Dec 14 '16 at 12:13
Oh, sorry @Yves. Yes, you certainly did. I cannot see why a contiguous stream would be a problem. That's what I was assuming here, and it actually makes the implementation much simpler. — Cody Gray - on strike, Dec 14 '16 at 13:04
@CodyGray: I guess that you are missing the point of the buffer organization. A whole period requires three contiguous reads for three non-contiguous writes and a much more complex shuffling than what you showed. — , Dec 14 '16 at 13:06
I'd go for making the number of separate output streams a template parameter, so for each implementation it can be a compile-time constant. Unless it turns out that you can get efficient code for a general purpose routine. — Peter Cordes, Dec 16 '16 at 05:14

Matt Timmermans · Answer 2 · 2016-12-16T04:03:05.883

Good news! You don't need multiple cores, SIMD, or fancy packages to solve this problem. You probably don't even need to call out to C.

Your bottleneck is memory bandwidth, because you're using it inefficiently.

With that CPU, your memory is probably fast enough to demux > 3GB/s using one core, but each time you need a sample from RAM, the CPU fetches 64 bytes to fill a cache line, and you only use 2 of them. Those 64 bytes will hang around in cache for a while and some other threads will probably use some of them, but the access pattern is still really bad.

All you really have to do is make good use of those 64 bytes. There are many ways. For example:

1) Try a simple loop in C#. Run through your input buffer from start to end, putting each sample you encounter where it belongs. This will use all 64 bytes every time you fill a cache line in reading, and your 16K output channels are few enough that the blocks you're writing to will mostly stay cached. This will probably be fast enough.

2) Continue calling out to your C function, but process the input buffer in 2MB chunks, and don't bother with multiple cores. Each of those 2MB chunks is small enough to stay cached until you're done with it. This will probably be a bit faster than (1).

3) If the above aren't fast enough (it may be close), then it you can go multithreaded. Use method (2), but do a parallel For over the chunks. That way each core can do a whole 2MB chunk, making good use of its cache, and they won't contend with each other. Use at most 4 threads, or you may start stressing your cache again. If you really need to use more than 4 threads, then divide the work more finely, into groups of 1024 channels within each 2MB block or so... but you won't need to do that.

EDIT:

Oh, sorry -- option (1) is pretty difficult to implement in unsafe C# because each fixed statement only fixes a few pointers, and using managed arrays is too slow. Option (2) is easy in unsafe C#, though, and still works fine. I wrote a test:

public static unsafe void doit(short[] inarray, short[][] demux, int nSignals, int nSamples)
{
    fixed (short *pin=inarray)
    {
        for(int block=0; block < nSamples; block+=64)
        {
            for(int sig=0; sig<nSignals; ++sig)
            {
                fixed(short *psig = demux[sig])
                {
                    short* s = pin + block * nSignals + sig;
                    short* d = psig + block;
                    short* e = d + Math.Min(64, nSamples - block);
                    while(d<e)
                    {
                        *d++ = *s;
                        s += nSignals;
                    }
                }
            }
        }
    }
}
public static void Main()
{
    int nSignals = 16384;
    int nSamples = 20000;

    short[][] demux = new short[nSignals][];
    for (int i = 0; i < demux.Length; ++i)
    {
        demux[i] = new short[nSamples];
    }
    short[] mux = new short[nSignals * nSamples];
    //warm up
    doit(mux, demux, nSignals, nSamples);
    doit(mux, demux, nSignals, nSamples);
    doit(mux, demux, nSignals, nSamples);
    //time it
    var watch = System.Diagnostics.Stopwatch.StartNew();
    doit(mux, demux, nSignals, nSamples);
    watch.Stop();
    Console.WriteLine("Time (ms): " + watch.ElapsedMilliseconds);
}

That's one second of data, and on my box it says:

Time (ms): 724

Hmmm.. that's better than real time, but not twice as fast as real time. Your box looks to be a little faster than mine, so maybe it's OK. Lets try the parallel version (3). The Main function is the same:

public static unsafe void dopart(short[] inarray, short[][] demux, int offset, int nSignals, int nSamples)
{
    fixed (short* pin = inarray)
    {
        for (int block = 0; block < nSamples; block += 64)
        {
            for (int sig = 0; sig < nSignals; ++sig)
            {
                fixed (short* psig = demux[sig])
                {
                    short* s = pin + (offset + block) * nSignals + sig;
                    short* d = psig + offset + block;
                    short* e = d + Math.Min(64, nSamples - block);
                    while (d < e)
                    {
                        *d++ = *s;
                        s += nSignals;
                    }
                }
            }
        }
    }
}
public static unsafe void doit(short[] inarray, short[][] demux, int nSignals, int nSamples)
{
    int steps = (nSamples + 1023) / 1024;
    ParallelOptions options = new ParallelOptions();
    options.MaxDegreeOfParallelism = 4;
    Parallel.For(0, steps, options, step => {
        int offset = (int)(step * 1024);
        int size = Math.Min(1024, nSamples - offset);
        dopart(inarray, demux, offset, nSignals, size);
    });
}

That's better:

Time (ms): 283

The amount of data being read and written in this case is ~ 4.6 GB/s, which is a bit off my theoretical maximum of 6.4 GB/s, and I only have 4 real cores, so I might be able to get that down a bit by calling out to C, but there's not a lot of room for improvement.

Hi Matt, are you sure of that? Did you test it? In my "Edit:further tests" I'm using a single core and I pass to the C function an entire chunk of 1sec of data (16k signals * 20k) that I run through from start to end (similarly to what you mention in (1) I guess), and still I'm very slow... — Mauro Ganswer, Dec 14 '16 at 16:21
@Mauro, the C function you wrote in edit:further tests doesn't do what you want -- it mixes up all the signals in each output buffer. I guessed that you actually tested a different implementation that works. You also didn't put in the timing code so I can't tell what you measured (16000 GCHandle::Alloc calls might take a while). Anyway, yes I'm sure that you should be able to do this no problem, but of course I'm not sure that you're not making another mistake I can't see — Matt Timmermans, Dec 14 '16 at 20:43
A simple function in C# doing (maybe) what you are saying (see below) takes 1000% of the time (10 times more). Can you please post an example? int signalIdx = 0, frameIdx = 0; for (int i = 0; i < muxed.Length; i++) { demuxed[signalIdx][frameIdx] = muxed[i]; if (++signalIdx == nSignals){ signalIdx = 0; frameIdx++; } } — Mauro Ganswer, Dec 15 '16 at 16:09
Thanks @Matt, in my case I obtain not so good performances with 4 tasks (I have 2 CPU, 16 cores and 32 threads in total, but each core has "only" 2.1GHz), however scaling up the N of tasks I obtain good results. See my edit. I voted up but I still do not mark it as a solution because I'm doing further tests. — Mauro Ganswer, Dec 18 '16 at 13:07
@MauroGanswer I'm still surprised at your per-core performance. Are you sure you're running the release (optimized) build? — Matt Timmermans, Dec 21 '16 at 16:40

Need for fast data demuxing in C# by using multi-threading, AVX, GPU or whatever

2 Answers2