I have a pretty simple function that demultiplex data acquired from a board. So data comes by frames, each frame made up by multiple signals, as a 1-dim array and I need to convert to jagged arrays, one for each signal. Basically the following:
I'm working in C# but I've a core function in C that does the work for a single signal:
void Functions::Demux(short*pMux, short* pDemux, int nSignals, int signalIndex, int nValuesPerSignal)
{
short* pMuxStart = pMux + signalIndex;
for (size_t i = 0; i < nValuesPerSignal; i++)
*pDemux++ = *(pMuxStart + i * nSignals);
}
and then I call it via C++/CLI (using pin_ptr<short>
, so no copy) from C# and with parallel for:
Parallel.For(0, nSignals, (int i) =>
{
Core.Functions.Demux(muxed, demuxed[i], nSignals, i, nFramesPerSignal);
});
The muxed data comes from 16k signals (16bit resolution), each signal has 20k samples/s, which turns into a data rate of 16k * 20k * 2 = 640MB/s. When running the code on a workstation with 2 Xeon E5-2620 v4 (in total 16 cores @2.1GHz) it takes about 115% to demux (for 10s of data it takes 11.5s).
I need to go down at least to half the time. Does anybody knows of some way, perhaps with AVX technology, or better of some high performance library for this? Or maybe is there a way by using GPU? I would prefer not to improve the CPU hardware because that will cost probably more.
Edit
Please consider that nSignals
and nValuesPerSignal
can change and that the interleaved array must be split in nSignals separate arrays to be further handled in C#.
Edit: further tests
In the meantime, following the remark from Cody Gray, I tested with a single core by:
void _Functions::_Demux(short*pMux, short** pDemux, int nSignals, int nValuesPerSignal)
{
for (size_t i = 0; i < nSignals; i++)
{
for (size_t j = 0; j < nValuesPerSignal; j++)
pDemux[i][j] = *pMux++;
}
}
called from C++/CLI:
int nSignals = demuxValues->Length;
int nValuesPerSignal = demuxValues[0]->Length;
pin_ptr<short> pMux = &muxValues[0];
array<GCHandle>^ pins = gcnew array<GCHandle>(nSignals);
for (size_t i = 0; i < nSignals; i++)
pins[i] = GCHandle::Alloc(demuxValues[i], GCHandleType::Pinned);
try
{
array<short*>^ arrays = gcnew array<short*>(nSignals);
for (int i = 0; i < nSignals; i++)
arrays[i] = static_cast<short*>(pins[i].AddrOfPinnedObject().ToPointer());
pin_ptr<short*> pDemux = &arrays[0];
_Functions::_Demux(pMux, pDemux, nSignals, nValuesPerSignal);
}
finally
{ foreach (GCHandle pin in pins) pin.Free(); }
and I obtain a computing time of about 105%, which is too much but clearly shows that Parallel.For is not the right choice. From your replies I guess the only viable solution is SSE/AVX. I never wrote code for that, could some of you instruct me in the right direction for this? I think we can suppose that the processor will support always AVX2.
Edit: my initial code vs Matt Timmermans solution
On my machine I compared my initial code (where I was simply using Parallel.For
and calling a C function responsible to deinterlace a single signal) with the code proposed by Matt Timmermans (still using Parallel.For
but in a cleverer way). See results (in ms) against the number of tasks used in the Parallel.For
(I have 32 threads):
N.Taks MyCode MattCode
4 1649 841
8 997 740
16 884 497
32 810 290
So performances are much improved. However, I'll still do some tests on the AVX idea.