4

I'm trying to do an FFT->signal manipulation->Inverse FFT using Project NE10 in my CPP project and convert the complex output to amplitudes and phases for FFT and vice versa for IFFT. But the performance of my C++ code is not as good as the SIMD enabled NE10 code as per the benchmarks. Since I have no experience with arm assembly, I'm looking for some help to write neon code for the unoptimised C module. For example, before IFFT I do this:

for (int bin = 0; bin < NUM_FREQUENCY_BINS; bin++) {
    input[bin].real = amplitudes[bin] * cosf(phases[bin]);
    input[bin].imag = amplitudes[bin] * sinf(phases[bin]);
}

where input is an array of C structs (for complex values), amplitudes & phases are float arrays.

The above block (O(n) complexity) takes about 0.6ms for 8192 bins while NE10 FFT (O(n*log(n)) complexity) takes only 0.1ms because of SIMD operations. From what I've read so far on StackOverflow and other places, intrinsics are not worth the effort, so I'm trying in arm neon only.

Shaucer
  • 43
  • 4
  • 1
    Intrinsics can *absolutely* be worth the effort. In particular, you get to leave the tedious work like register allocation to the compiler, which will *improve* performance compared to handwritten assembly unless you are *very* good at writing assembly and know the details of the microarchitecture the code will run on. – EOF Feb 17 '17 at 18:46
  • 1
    I have [a godbolt example](https://godbolt.org/g/te0BzE) which shows calls to `cosf` and `sinf` in the loop. Function calls give lots of memory overhead. Here is a [link here for you](http://stackoverflow.com/questions/1854254/fast-sine-cosine-for-armv7neon-looking-for-testers) and [another link](http://gruntthepeon.free.fr/ssemath/neon_mathfun.html). `cosf` and `sinf` are related so calculate both. The standard 'C' library has very stringent precision (or at least that is the emphasis). [sin(x)^2+cos(x)^2=1](https://en.wikibooks.org/wiki/Trigonometry/Sine_Squared_plus_Cosine_Squared). – artless noise Feb 19 '17 at 14:52
  • ... both at once above... That said, your 'signal manipulation' could be performed on polar or real/imaginary so you don't need to do this conversion. Really you need to do some math to keep things in the same co-ordinates (or at least that is worth investigating). 'Convolution' might be something for you to add to your tool kit? Or at least the signal manipulation aspect of your problem is probably important; Ie, why are you needing the conversion? – artless noise Feb 19 '17 at 14:55
  • I'm changing an input signal from the microphone by shifting the amplitudes to higher frequencies. for me, the easiest way is to work with amplitudes. – Shaucer Feb 20 '17 at 09:56

2 Answers2

1

You can use NEON for trig functions if you settle for approximations. I am not affiliated, but there is an implementation here that uses intrinsics to create vectorised sin/cos functions accurate to many decimal places that perform substantially better than simply calling sinf, etc (benchmarks are provided by the author).

The code is especially well suited to your polar to cartesian calculation, as it generates sin and cos results simultaneously. It might not be suitable for something where absolute precision is crucial, but for anything to do with frequency domain audio processing, this normally is not the case.

scary_jeff
  • 4,314
  • 13
  • 27
0

As I know NEON doesn't support vector operations for geometric functions (sin, cos). But of course you can improve your code. As variant you can use the table of pre-calculated values of functions sinus and cosine. It can lead to significant improvement of performance.

Concerning to using of intrinsics for NEON. I have tried to use both of them, but in most case they give practically the same result (for modern compiler). But using if assembler is more labor-intensive. The main performance improvement is given by the correct manipulation with data (loading, storing) and using of vector instructions but these actions can be performed with using of intrinsics .

Of course if you want to achieve 100% utilization of CPU you sometimes need to use assembler. But it is rare case.

ErmIg
  • 3,980
  • 1
  • 27
  • 40