C++ Multithreading decoding audio data

Question

I need to decode audio data as fast as possible using the Opus decoder.

Currently my application is not fast enough. The decoding is as fast as it can get, but I need to gain more speed.

I need to decode about 100 sections of audio. T hese sections are not consecutive (they are not related to each other).

I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel. I have not used multithreading before.

I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.

Thank you.

If you're desired operations could be done without depending on each other, they might be good candidates to be run in parallel. — Mark Garcia, Jul 01 '13 at 04:58
@MarkGarcia Can you tell me how this should be done? Does the decoder have to be a separate exe, or can I implement this in my main application? — tmighty, Jul 01 '13 at 05:26
you could check [task-based parallelism](http://msdn.microsoft.com/en-us/library/dd492427.aspx) or [parallel::for_each](http://msdn.microsoft.com/en-us/library/dd492418.aspx), the same is possible with [tbb](http://threadingbuildingblocks.org/) as well — Dmitry Ledentsov, Jul 01 '13 at 05:40
Are these 100 sections of audio on disk? How much of the time is taken just to read/write disk? Disks can't be read in parallel so multithreading will not help with the disk-related portion of the work. — ScottMcP-MVP, Jul 01 '13 at 08:00
@ScottMcP-MVP Yes, the audio is on the disk, but I am not talking about the time required to read the data from the disk. I was really talking only about the time needed to decode the data. — tmighty, Jul 01 '13 at 09:04
What format is the audio in, what format do you want to convert it to? — Neil, Jul 03 '13 at 15:16
Right now your question is overly vague. I think the best thing is to elaborate a bit more on the data format and architecture, and what exactly it is you see/hope "in your dreams". If it weren't for the massive bounty, this question would probably have a few close flags by now for not being specific enough. Funny how that works ey :P — Preet Kukreti, Jul 03 '13 at 15:16
Make sure you are using non-blocking, multiplexed I/O with edge-triggered polling before you think about multithreading. — Kerrek SB, Jul 05 '13 at 21:11

justin · Answer 1 · 2013-07-03T16:57:52.453

3

You would break your work up by task. Let's assume your process is in fact CPU bound (you indicate it is but… it's not usually that simple).

Right now, you decode 100 sections:

I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.

Actually, you should use a number close to the number of cores on the machine.

Assuming a modern desktop (e.g. 2-8 cores), running 100 threads at once will just slow it down; The kernel will waste a lot of time switching from one thread to another and the process is also likely to use higher peak resources and contend for similar resources.

So just create a task pool which restricts the number of active tasks to the number of cores. Each task would (generally) represent the decoding work to perform for one input file (section). This way, the decoding process is not actually sharing data across multiple threads (allowing you to avoid locking and other resource contention).

When complete, go back and fine tune the number of processes in the task pool (e.g. using the exact same inputs and a stopwatch on multiple machines). The fastest may be lower or higher than the number of cores (most likely because of disk I/O). It also helps to profile.

I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.

Yes, if the problem is CPU bound, then that is generally fine. This also assumes your decoder/dependent software is capable of running with multiple threads.

The problem you will realize if these are files on disk is that you will probably need to optimize how you read (and write?) the files from many cores. So allowing it to run 8 jobs at once can make your problem become disk bound -- and 8 simultaneous readers/writers is a bad way to use hard disks, so you may find that it is not as fast as you expected. Therefore, you may need to optimize I/O for your concurrent decode implementation. In this regard, using larger buffer sizes, but that comes at a cost in memory.

edited Jul 03 '13 at 16:57

answered Jul 03 '13 at 15:26

justin

104,054
14
179
226

1

good answer, though usually the # of threads that you should have is (# cores on the machine)*2 – Adrian Jul 03 '13 at 16:00
1

`threads = n*2`? I've usually used `threads = n` for `n <= 2`, `threads = n+1` for `2 < n <= 10`, `threads = n * 1.2` for `n > 10`; – antiduh Jul 03 '13 at 17:21
1

see http://stackoverflow.com/a/13958877/1007845 (best if you do some tests on your CPU and the tasks; some handle 8 threads some 40 per core) – Adrian Jul 03 '13 at 17:28
I measured threads = n+1 as the fastest (in java). – AlexWien Jul 03 '13 at 17:34
1

@antiduh i upvoted Adrian's comment. there are many variables, notably complexity to decode (encoding), compression ratios of inputs, disk I/O, memory speeds, targeted architecture (e.g. has hyperthreading?), and pool implementation. i've an implementation which, when run an SSD, can handle NCORES*32 and complete within about 2% percent of the fastest run (that's mostly PCM format conversion). of course, it's wasteful to use NCORES*32, and that difference is much larger when run off a hard disk. it "depends on a lot of variables", but NCores*2 or NCores/2 may be fastest for a particular task. – justin Jul 03 '13 at 17:43
1

@justin - I agree with you, there's a lot of variables. I meant, moreso for purelly cpu-bound code. For this specific situation, I'd sooner recommend a pipeline where 1-2 thread reads files in, N*fudge threads processing buffers, 1-2 thread writing files out. Cuts down on the number of total threads, and usually plays nicer with the disk - 10 threads trying to write 10 files at the same time can make hdd's seek everywhere, wasting time instead of writing one or two files at a time. – antiduh Jul 03 '13 at 17:55
1

@antiduh yes, i started with NCORES under the CPU bound assumption, and just decided to leave it because I/O often becomes an issue before NCORES*2 is attained. telling a hard disk to be many places at once is going to be a possible example of NCORES/2 being faster to complete than NCORES. without knowing the OP's encoding or more about these variables, i can't really make a good recommendation -- so i just recommended he/she start with NCORES and measure, instead. – justin Jul 03 '13 at 18:01

antiduh · Accepted Answer · 2018-11-02T03:02:40.713

This answer is probably going to need a little refinement from the community, since it's been a long while since I worked in this environment, but here's a start -

Since you're new to multi-threading in C++, start with a simple project to create a bunch of pthreads doing a simple task.

Here's a quick and small example of creating pthreads:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void* ThreadStart(void* arg);

int main( int count, char** argv) {
        pthread_t thread1, thread2;

        int* threadArg1 = (int*)malloc(sizeof(int));
        int* threadArg2 = (int*)malloc(sizeof(int));

        *threadArg1 = 1;
        *threadArg2 = 2;

        pthread_create(&thread1, NULL, &ThreadStart, (void*)threadArg1 );
        pthread_create(&thread2, NULL, &ThreadStart, (void*)threadArg2 );

        pthread_join(thread1, NULL);
        pthread_join(thread2, NULL);
        free(threadArg1);
        free(threadArg2);
}

void* ThreadStart(void* arg) {

        int threadNum = *((int*)arg);
        printf("hello world from thread %d\n", threadNum);

        return NULL;
}

Next, you're going to be using multiple opus decoders. Opus appears to be thread safe, so long as you create separate OpusDecoder objects for each thread.

To feed jobs to your threads, you'll need a list of pending work units that can be accessed in a thread safe manner. You can use std::vector or std::queue, but you'll have to use locks around it when adding to it and when removing from it, and you'll want to use a counting semaphore so that the threads will block, but stay alive, while you slowly add workunits to the queue (say, buffers of files read from disk).

Here's some example code similar from above that shows how to use a shared queue, and how to make the threads wait while you fill the queue:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <queue>
#include <semaphore.h>
#include <unistd.h>

void* ThreadStart(void* arg);

static std::queue<int> workunits;
static pthread_mutex_t workunitLock;
static sem_t workunitCount;

int main( int count, char** argv) {
    pthread_t thread1, thread2;

    pthread_mutex_init(&workunitLock, NULL);
    sem_init(&workunitCount, 0, 0);

    pthread_create(&thread1, NULL, &ThreadStart, NULL);
    pthread_create(&thread2, NULL, &ThreadStart, NULL);

    // Make a bunch of workunits while the threads are running.
    for (int i = 0; i < 200; i++ ){
        pthread_mutex_lock(&workunitLock);

        workunits.push(i);
        sem_post(&workunitCount);

        pthread_mutex_unlock(&workunitLock);

        // Pretend that it takes some effort to create work units;
        // this shows that the threads really do block patiently
        // while we generate workunits.
        usleep(5000);
    }

    // Sometime in the next while, the threads will be blocked on
    // sem_wait because they're waiting for more workunits. None
    // of them are quitting because they never saw an empty queue.
    // Pump the semaphore once for each thread so they can wake
    // up, see the empty queue, and return.

    sem_post(&workunitCount);
    sem_post(&workunitCount);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    pthread_mutex_destroy(&workunitLock);
    sem_destroy(&workunitCount);

}

void* ThreadStart(void* arg) {

    int workUnit;
    bool haveUnit;

    do{
        sem_wait(&workunitCount);

        pthread_mutex_lock(&workunitLock);

        // Figure out if there's a unit, grab it under
        // the lock, then release the lock as soon as we can.
        // After we release the lock, then we can 'process'
        // the unit without blocking everybody else.
        haveUnit = !workunits.empty();

        if ( haveUnit ) {
            workUnit = workunits.front();
            workunits.pop();
        }
        pthread_mutex_unlock(&workunitLock);

        // Now that we're not under the lock, we can spend
        // as much time as we want processing the workunit.
        if ( haveUnit ) {
            printf("Got workunit %d\n", workUnit);
        }
    } 
    while(haveUnit);

    return NULL;
}

And uh, sorry for all of the edits :) – antiduh Jul 03 '13 at 17:16 — antiduh, Jul 03 '13 at 17:16

score 2 · Answer 3 · answered Jul 03 '13 at 15:56

Instead of making your own threads and manage them, I suggest you use a thread pool and give your decoding tasks to the pool. The pool will assign tasks to as many threads as it and the system can handle. Though there are different types of thread pools so you can set some parameters like forcing it to use some specific number of threads or if you should allow the pool to keep increasing the number of threads.

One thing to keep in mind is that more threads doesn't mean they execute in parallel. I think the correct term is concurrently, unless you have the guarantee that each thread is run on a different CPU (which would give true parallelism)

Your entire pool can come to a halt if blocked for IO.

score 0 · Answer 4 · answered Jul 01 '13 at 05:44

0

Before jumping into multithreading as solution to speed up things , Study the concept of Oversubscribing & under Subscribing .

If the processing of Audio involves .long blocking calls of IO , Then Multithreading is worth it.

answered Jul 01 '13 at 05:44

Anand Rathi

790
4
11

You seem to have a good point, but could you please not be so cryptic? – tmighty Jul 01 '13 at 08:50
I didnt mean to be cryprtic , let me know if we need to expand on any partyicular point? – Anand Rathi Jul 01 '13 at 08:59
You mean that if the lack of speed is due to disk access (=IO), multithreading is not working (because a disk can not be accessed multiple times at the same time), but if I am really talking about the decoding done by the CPU, it may be worth it? – tmighty Jul 01 '13 at 09:06
@tmighty you are correct on the first part of your comment. I don't really understand your second part. Yes, only the CPU does decoding anyway. You probably meant to say that if the data is in memory as opposed to disk. – Adrian Jul 03 '13 at 17:24
1

@Adrian he meant, if you access the disk, this needs so long time, that your CPU has nothing to do. In the meantime, another thread could decode audio – AlexWien Jul 03 '13 at 17:37
thanks @AlexWien, Exactly , if disk access is taking time you have cpu cycle availaible for another thread , else your CPU will be wasting time on scheduling threads instead of decoding audio – Anand Rathi Jul 04 '13 at 10:25

score 0 · Answer 5 · answered Jul 03 '13 at 15:24

Although the vagueness of you question doesn't really help...how about:

Create a list of audio files to convert.

While there is a free processor, 
   launch the decoder application with the next file in the queue.   
Repeat until there is nothing else in the list

If, during testing, you discover the processors aren't always 100% busy, launch 2 decodes per processor.

It could be done quite easily with a bit of bash/tcl/python.

score 0 · Answer 6 · answered Jul 06 '13 at 08:35

You can use threads in general but locking has some issues. I will base the answer around POSIX threads and locks but this is fairly general and you will able to port the idea to any platform. But if your jobs require any kind of locking, you may find the following useful. Also it is best to keep using the existing threads again and again because thread creations are costly(see thread pooling).

Locking is a bad idea in general for "realtime" audio since it adds latency, but that's for real time jobs for decoding/encoding they are perfectly ok, even for real time ones you can get better performance and no dropping frames by using some threading knowledge.

For audio, semaphores is a bad, bad idea. They are too slow on at least my system(POSIX semaphores) when I tried, but you will need them if you are thinking of cross thread locking(not the type of locking where you lock in one thread and unlock in the same thread). POSIX mutexes only allow self lock and self unlock(you have to do both in the same thread) otherwise the program might work but it's undefined behavior and should be avoided.

Most lock-free atomic operations might give you enough freedom from locks to use some functionality(like locking) but with better performance.

C++ Multithreading decoding audio data

6 Answers6