Realtime audio application, improving performance

Question

I am currently writing a C++ real time audio application which roughly contains:

reading frames from a buffer
interpolating frames with the hermit interpolation here
filtering ever frame with two biquad filters (and updating their coefficients every frame)
a 3 band crossover containing 18 biquad calculations
a FreeVerb algorithm from the STK libary here

I think this should be handable for my PC but I get some buffer underflows every so often so I would like to improve the performance of my application. I have a bunch of question I hope you can answer me. :)

1) Operator Overloading

Instead of working directly with my flaot samples and doing calculations for every sample, I pack my floats in a Frame class which contains the left and the right Sample. The class overloads some operators for addition, subtraction and multiplication with float.

The filters (biquad mostly) and the reverb works with floats and doesn't use this class but the hermite interpolator and every multiplication and addition for volume controll and mixing uses the class.

Does this has an impact on the performance and would it be better to work with left and right sample directly?

2) std::function

The callback function from the audio IO libary PortAudio calls a std::function. I use this to encapsulation everything related to PortAudio. So the "user" sets his own callback function with std::bind

std::bind(  &AudioController::processAudio, 
            &(*this), 
            std::placeholders::_1, 
            std::placeholders::_2));

Since for every callback, the right function has to be found from the CPU (however this works...), does this have an impact and would it be better to define a class the user has to inherit from?

3) virtual functions

I use a class called AudioProcessor which declares a virtual function:

virtual void tick(Frame *buffer, int frameCout) = 0;

This function always processes a number of frames at once. Depending on the drive, 200 frames up to 1000 frames per call. Within the signal processing path, I call this function 6 time from multiple derivated classes. I remember that this is done with lookup tables so the CPU knows exactly which function it has to call. So does the process of calling a "virtual" (derivated) function has an impact on the performance?

The nice thing about this is the structure in the source code but only using inlines maybe would have an performance improvement.

These are all questions for now. I have some more about Qt's event loop because I think that my GUI uses quite a bit of CPU time as well. But this is another topic I guess. :)

Thanks in advance!

These are all relevant function calls within the signal processing. Some of them are from the STK libary. The biquad functions are from STK and should perform fine. This goes for the freeverb algorithm as well.

// ################################ AudioController Function ############################
void AudioController::processAudio(int frameCount, float *output) {
    // CALCULATE LEFT TRACK

    Frame * leftFrameBuffer = (Frame*) output;

    if(leftLoaded) { // the left processor is loaded
        leftProcessor->tick(leftFrameBuffer, frameCount);   //(TrackProcessor::tick()
    } else {
        for(int i = 0; i < frameCount; i++) {
            leftFrameBuffer[i].leftSample  = 0.0f;
            leftFrameBuffer[i].rightSample = 0.0f;
        }
    }

    // CALCULATE RIGHT TRACk

    if(rightLoaded) { // the right processor is loaded
        // the rightFrameBuffer is allocated once and ensured to have enough space for frameCount Frames
        rightProcessor->tick(rightFrameBuffer, frameCount); //(TrackProcessor::tick()
    } else {
        for(int i = 0; i < frameCount; i++) {
            rightFrameBuffer[i].leftSample  = 0.0f;
            rightFrameBuffer[i].rightSample = 0.0f;
        }
    }

    // MIX
    for(int i = 0; i < frameCount; i++ ) {
        leftFrameBuffer[i] = volume * (leftRightMix * leftFrameBuffer[i] + (1.0 - leftRightMix) * rightFrameBuffer[i]);
    }
}

// ################################ AudioController Function ############################

void TrackProcessor::tick(Frame *frames, int frameNum) {
    if(bufferLoaded && playback) {
        for(int i = 0; i < frameNum; i++) {
            // read from buffer
            frames[i] =  bufferPlayer->tick();

            // filter coeffs
            caltulateFilterCoeffs(lowCutoffFilter->tick(), highCutoffFilter->tick());

            // filter
            frames[i].leftSample = lpFilterL->tick(hpFilterL->tick(frames[i].leftSample));
            frames[i].rightSample = lpFilterR->tick(hpFilterR->tick(frames[i].rightSample));
        }
    } else {
        for(int i = 0; i < frameNum; i++) {         
            frames[i] = Frame(0,0);
        }
    }

    // Effect 1, Equalizer
    if(effsActive[0]) {
        insEffProcessors[0]->tick(frames, frameNum);
    }
    // Effect 2, Reverb
    if(effsActive[1]) {
        insEffProcessors[1]->tick(frames, frameNum);
    }

    // Volume
    for(int i = 0; i < frameNum; i++) {
        frames[i].leftSample  *= volume;
        frames[i].rightSample *= volume;
    }
}

// ################################ Equalizer ############################

void EqualizerProcessor::tick(Frame *frames, int frameNum) {
    if(active) {
        Frame lowCross;
        Frame highCross;

        for(int f = 0; f < frameNum; f++) {

            lowAmp = lowAmpFilter->tick();
            midAmp = midAmpFilter->tick();
            highAmp = highAmpFilter->tick();

            lowCross =  highLPF->tick(frames[f]);
            highCross = highHPF->tick(frames[f]);

            frames[f] = lowAmp * lowLPF->tick(lowCross) 
                      + midAmp * lowHPF->tick(lowCross) 
                      + highAmp * lowAPF->tick(highCross);
        }
    }
}

// ################################ Reverb ############################
// This function just calls the stk::FreeVerb tick function for every frame
// The FreeVerb implementation can't realy be optimised so I will take it as it is.

void ReverbProcessor::tick(Frame *frames, int frameNum) {
    if(active) {
        for(int i = 0; i < frameNum; i++) {
            frames[i].leftSample = reverb->tick(frames[i].leftSample, frames[i].rightSample);
            frames[i].rightSample = reverb->lastOut(1);
        }
    }
}

// ################################ Buffer Playback (BufferPlayer) ############################

Frame BufferPlayer::tick() {
    // adjust read position based on loop status
    if(inLoop) {
        while(readPos > loopEndPos) {
            readPos = loopStartPos + (readPos - loopEndPos); 
        }
    }

    int x1  = readPos;
    float t = readPos - x1;

    Frame f = interpolate(buffer->frameAt(x1-1), 
                          buffer->frameAt(x1),
                          buffer->frameAt(x1+1),
                          buffer->frameAt(x1+2),
                          t);

    readPos += stepSize;;
    return f;
}

// interpolation:
Frame BufferPlayer::interpolate(Frame x0, Frame x1, Frame x2, Frame x3, float t) {
    Frame c0 = x1;
    Frame c1 = 0.5f * (x2 - x0);
    Frame c2 = x0 - (2.5f * x1) + (2.0f * x2) - (0.5f * x3);
    Frame c3 = (0.5f * (x3 - x0)) + (1.5f * (x1 - x2));
    return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}


inline Frame BufferPlayer::frameAt(int pos) {
    if(pos < 0) {
        pos = 0;
    } else if (pos >= frames) {
        pos = frames -1;
    }

    // get chunk and relative Sample
    int chunk = pos/ChunkSize;
    int chunkSample = pos%ChunkSize;

    return Frame(leftChunks[chunk][chunkSample], rightChunks[chunk][chunkSample]); 
}

These language constructs won't be a bottleneck. It is more likely your audio processing is slow or you are doing something like performing dynamic allocations (`new Frame`) far more often than you need to. — Radiodef, Nov 09 '14 at 19:48
I don't allocate anything in the callback, I use existing buffers. I think I will add all the relevant functions so you can check the processing. ;) — ruhig brauner, Nov 09 '14 at 19:57
You should always profile before optimizing. This will help isolate the region(s) of your code that need performance optimizing. — Thomas Matthews, Nov 09 '14 at 20:08
Are there sections of the code that can be performed in parallel? For example, on thread works on left sample and another thread working on right sample. Also, see if you can employ the processing power of the Graphics Processing Unit. — Thomas Matthews, Nov 09 '14 at 20:11
Yes, the equalizer processing could be split into left and right but I don't know if it's worth allocating a new thread. (However this would work...) — ruhig brauner, Nov 09 '14 at 20:20

score 4 · Answer 1 · answered Nov 09 '14 at 20:38

Some suggestions on performance improvement:

Optimize Data Cache Usage

Review your functions that operate on a lot of data (e.g. arrays). The functions should load data into cache, operate on the data, then store back into memory.

The data should be organized to best fit into the data cache. Break up the data into smaller blocks if it doesn't fit. Search the web for "data driven design" and "cache optimizations".

In one project, performing data smoothing, I changed the layout of data and gained 70% performance.

Use Multiple Threads

In the big picture, you may be able to use at least three dedicated threads: input, processing and output. The input thread obtains the data and stores it in buffer(s); search the Web for "double buffering". The second thread gets data from the input buffer, processes it, then writes to an output buffer. The third thread writes data from the output buffer to the file.

You may also benefit from using threads for left and right samples. For example, while one thread is processing the left sample, another thread could be processing the right sample. If you could put the threads on different cores, you may see even more performance benefit.

Use the GPU processing

A lot of modern Graphics Processing Units (GPU) have a lot of cores that can process floating point values. Maybe you could delegate some of the filtering or analysis functions to the cores in the GPU. Be aware that this requires overhead and to gain the benefit, the processing part should be more computative than the overhead.

Reducing the Branching

Processors prefer to manipulate data over branching. Branching stalls the execution as the processor has to figure out where to get and process the next instruction. Some have large instruction caches that can contain small loops; but there is still a penalty for branching to the top of the loop again. See "Loop Unrolling". Also check your compiler optimizations and optimize high for performance. Many compilers will switch to loop unrolling for you, if the circumstances are correct.

Reduce the Amount of Processing

Do you need to process the entire sample or portions of it? For example, in video processing, much of the frame doesn't change only small portions. So the entire frame doesn't need to be processed. Can the audio channels be isolated so only a few channels are processed rather than the entire spectrum?

Coding to Help the Compiler Optimize

You can help the compiler with optimizations by using the const modifier. The compiler may be able to use different algorithms for variables that don't change versus ones that do. For example, a const value can be placed in the executable code, but a non-const value must be placed in memory.

Using static and const can help too. The static usually implies only one instance. The const implies something that doesn't change. So if there is only one instance of the variable that doesn't change, the compiler can place it into the executable or read-only memory and perform a higher optimization of the code.

Loading multiple variables at the same time can help too. The processor can place the data into the cache. The compiler may be able to use specialized assembly instructions for fetching sequential data.

Hi, thanks for the response. I have some questions: The whole processing is seperated into two parts so processing these in different threads would realy make a difference. But how should I implement this? (Framework? c++ std or something else?) I never heard about optimizing the cache usage but I will read into the topic. :) Thanks for the tips! — ruhig brauner, Nov 09 '14 at 20:48
Please clarify "how should I implement this"? Are you asking how to implement threads? Threads are usually a platform specific issue, but take a look at `Boost::threads` or search for "Boost library threading" — Thomas Matthews, Nov 09 '14 at 20:50
Yes, that was the question. :) I didn't realy worked with threads expect in some curses but in most cases, we used threads in order to learn how they work, not so we get any (for example) performance improvement out of them. ;D — ruhig brauner, Nov 09 '14 at 20:53
I've just boostet the performance to 150% by letting a seperate thread process the left track. std::thread did fine. :D — ruhig brauner, Nov 09 '14 at 21:12
With your kind of application, guessing where the performance bottleneck is tricky. The bottleneck could be with file I/O (as is usually with most I/O base programs) or it could be with data analysis (common with data intensive programs). I suggest you review your input and output functionality. The objective, if possible, is to not have the main program wait for I/O or to have the data supplier wait (as with hard drives). — Thomas Matthews, Nov 09 '14 at 21:17
Up voted for a good list of low hanging fruit for optimization. I would probably add Vectorization and Unrolling of small loops. — BlamKiwi, Nov 09 '14 at 22:27
for realtime audio performance GPU latency is typically an issue. You will get the most gains from making sure your effect implementations are using SIMD instructions! Consider using these for your mixing as well. — yano, Feb 10 '17 at 00:29