Audio signal source separation with neural network

Question

What I am trying to do is separating the audio sources and extract its pitch from the raw signal. I modeled this process myself, as represented below: model to decomposite the raw signal Each sources oscillate in normal modes, often makes its component peaks' frequency integer multiplication. It's known as Harmonic. And then resonanced, finally combined linearly.

As seen in above, I've got many hints in frequency response pattern of audio signals, but almost no idea how to 'separate' it. I've tried countless of my own models. This is one of them:

FFT the PCM
Get peak frequency bins and amplitudes.
Calculate pitch candidate frequency bins.
For each pitch candidates, using recurrent neural network analyze all the peaks and find appropriate combination of peaks.
Separate analyzed pitch candidates.

Unfortunately, I've got non of them successfully separates the signal until now. I want any of advices to solve these kind of problem. Especially in modeling of source separation like my one above.

I would advise you ask somewhere else - this question is probably far too serialised for here, and isn't really a software development question per se. — marko, Feb 02 '14 at 11:02
My understanding is that the combined pitch and amplitudes can result from infinitely many different source signals, hence the impossibility of finding the correct origins. I would recommend applying automatic feature extraction (Sparse Auto-encoders) on MFCC to obtain rather subtle features unique to each source combination. I am eager to test this and I might provide you with a plausible solution if you could provide the link hosting such dataset. Thanks! — IssamLaradji, Feb 12 '14 at 11:18
Dear @Memming, I of course heard of ICA but AFAIK it requires N-monitors to separate N-sources. It's not suitable for my case since audio files normally have less than 3 channels. — Laie, Feb 24 '14 at 07:31
Dear @IssamLaradji, firstly thank you for your advice especially for Sparce Auto-encoders. I am currently reviewing that technique how to apply to my problem. I'll gladly share you my research. For such audio dataset, this site has great samples: http://theremin.music.uiowa.edu/MISflute.html — Laie, Feb 24 '14 at 07:40
After reading this, I ran into the following lecture https://youtu.be/LuBer-0WmpQ linked from http://www.saneworkshop.org/ (2015 version) — Eponymous, Jun 21 '16 at 17:41

score 5 · Answer 1 · edited May 23 '17 at 12:02

Because no one has really attempted to answer this, and because you've marked it with the neural-network tag, I'm going to address the suitability of a neural network to this kind of problem. As the question was somewhat non-technical, this answer will also be "high level".

Neural networks require some sort of sample set from which to learn. In order to "teach" a neural net to solve this problem you would essentially need to have a working set of known solutions to work from. Do you have this? If so, read on. If not, a neural is probably not what you are seeking. You stated that you have "many hints" but no real solution. This leads me to believe you probably don't have sample sets. If you can get them, great, otherwise you might be out of luck.

Supposing now that you have a sample set of Raw Signal samples and corresponding Source 1 and Source 2 outputs... Well, now you're going to need a method for deciding on a topology. Assuming you don't know a lot about how neural nets work (and don't want to), and assuming you also don't know the exact degree of complexity of the problem, I would probably recommend the open source NEAT package to get you started. I am not affiliated in any way with this project, but I have used it, and it allows you to (relatively) intelligently evolve neural network topologies to fit the problem.

Now, in terms of how a neural net would solve this specific problem. The first thing that comes to mind is that all audio signals are essentially time-series. That is to say, the information they convey is somehow dependent and related to the data at previous timesteps (e.g. the detection of some waveform cannot be done from a single time-point; it requires information about previous timesteps as well). Again, there's a million ways of solving this problem, but since I'm already recommending NEAT I'd probably suggest you take a look at the C++ NEAT Time Series mod.

If you're going down this route, you'll probably be wanting to use some sort of sliding window to provide information about the recent past at each time step. For a quick and dirty intro to sliding windows, check out this SO question:

Time Series Prediction via Neural Networks

The size of the sliding window can be important, especially if you're not using recurrent neural nets. Recurrent networks allow neural nets to remember previous time steps (at the cost of performance - NEAT is already recurrent so that choice is made for you here). You will probably want the sliding window length (ie. the number of timesteps in the past provided at every time step) to be roughly equal to your conservative guess of the largest number of previous timesteps required to gain enough information to split your waveform.

I'd say this is probably enough information to get you started.

When it comes to deciding how to provide the neural net with the data, you'll first want to normalise the input signals (consider a sigmoid function) and experiment with different transfer functions (sigmoid would probably be a good starting point).

I would imagine you'll want to have 2 output neurons, providing normalised amplitude (which you would denormalise via the inverse of the sigmoid function) as the output representing Source 1 and Source 2 respectively. For the fitness value (the way you judge the ability of each tested network to solve the problem) would be something along the lines of the negative of the RMS error of the output signal against the actual known signal (ie. tested against the samples I was referring to earlier that you will need to procure).

Suffice to say, this will not be a trivial operation, but it could work if you have enough samples to train the network against. What is a good number of samples? Well as a rule of thumb it's roughly a number that is large enough such that a simple polynomial function of order N (where N is the number of neurons in the netural network you require to solve the problem) cannot fit all of the samples accurately. This is basically to ensure you are not simply overfitting the problem, which is a serious issue with neural networks.

I hope this has been helpful! Best of luck.

Additional note: your work to date wouldn't have been in vain if you go down this route. A neural network is likely to benefit from additional "help" in the form of FFTs and other signal modelling "inputs", so you might want to consider taking the signal processing you have already done, organising into an analog, continuous representation and feeding it as an input alongside the input signal.

But how do you make the network learn that order doesn't matter? It is like a "supervised clustering" problem where you teach it how to separate a mix into two sources, but it doesn't need to know what the sources are -- it just needs to know which frequency belongs to which source. But that's a problem because neuron #1 being source 2 and neuron #2 being source 1, is just as valid a solution as neuron #1 being source 1 and neuron #2 being source 2. "Just feed it both combinations of solutions" doesn't work because if you generalize to N signals it will be factorial complexity! — pete, Aug 15 '15 at 06:41
And to expand on my previous comment, how do you generalize the neural network to be able to separate into N sources rather than just 2 sources? It would have to figure out approximately how many sources there are in the first place. How would you approach that problem? — pete, Nov 28 '15 at 19:27

Audio signal source separation with neural network

1 Answers1