3

How to compare two spectrograms and score their similarity? How to pick the whole model/approach?

Recordings from phone I convert from .m4a to .wav then plot the spectrogram in Python. Recordings have same length so data can be represented in the same dimensional space. I filtered using Butterworth bandpass filter (cutoff frequency 400Hz and 3500Hz):

Filtered

To find region of interest, using OpenCV I filtered color (will make every clip different length which I don't want that):

mask

Embedding spectrograms to multidimensional points and score their accuracy as distance to the most accurate sample would be visualisable thanks to dimensionality reduction in some cluster-like space. But that seems too plain, doesn't involve training and thus making it hard to verify. How to use convolution neural networks or a combination of convolution neural network and delayed neural network to embed this spectrogram to multidimensional points, to compare output of the network instead?

I switched to the Mel spectrogram:

mel

How to use pre-trained convolution neural network models like VGG16 to embed spectrograms to tensors to compare them? Just remove last fully connected layer and flatten it instead?

user4157124
  • 2,809
  • 13
  • 27
  • 42
Josef Korbel
  • 1,168
  • 1
  • 9
  • 32

2 Answers2

1

You can use CNN of course, tensorflow has special classes for that for example as many other frameworks. You simply convert your image to a tensor and apply the network and as a result you get lower-dimensional vector you can compare.

You can train your own CNN too.

For best accuracy it is better to scale lower frequencies (bottom part) and compress higher frequencies in your picture since lower frequencies have more importance. You can read about Mel Scale for more information

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Hello Nikolay, thanks for your response, i will for sure filter my signal further with Mel Scale - thanks for that. About the CNN - I would love to train my CNN for this, but i dont have thousands of training samples, i just want to compare them, I have no idea how i would create that dataset with accuracy of audiosamples. About the CNN2 - You mentioned VGG16 VeryDeep ConvNet from K. Simonyan a A. Zisserman? You mean use that pretrained net to embed spectrogram to tensor? The VGG is trained on 1000classed images... – Josef Korbel May 12 '17 at 09:17
  • Yes, you can use VGG16 – Nikolay Shmyrev May 12 '17 at 10:39
  • How can VGG16, trained on images, help me with scoring similarity in audio tensors (spectograms)? I mean, that sounds like using model trained on dolphin sounds to classify images, i mean, since the process will be the same on all audio samples, the output tensors will also be comparable, but will not randomly weighted CNN yield the same thing? – Josef Korbel May 15 '17 at 12:12
  • VGG still learns basic graphics primitives so it should be somewhat helpful. You can use CIFAR network too, it is better than nothing. This is called "transfer learning", something like https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html – Nikolay Shmyrev May 15 '17 at 12:18
1

In my opinion, and according to Yann Lecun, when you target speech recognition with Deep Neural Network you have two obligations:

  • You need to use a Recurrent Neural Network in order to have the memory ability (memory is really important for speech recognition...)

and

  • you will need a lot of training data

You may try to use RNN on tensorflow, but you definitely need a lot of training data.

If you don't want (or can't) find or generate a lot training data, you have forget the deep learning to solve this ...

In that case (forget deep learning) you may take a look of how Shazam work (based on fingerprint algorithm)

A. STEFANI
  • 6,707
  • 1
  • 23
  • 48
  • It look like possible to do the job also with CNN, look at [http://stackoverflow.com/questions/22471072/convolutional-neural-network-cnn-for-audio] Hrant Khachatrian said: "We had around 95% accuracy ..." and give a link to a dataset provided in this TopCoder contest – A. STEFANI May 22 '17 at 10:24