34

I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.

I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.

I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.

I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.

Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.

The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.

I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.

moeabdol
  • 4,779
  • 6
  • 44
  • 43
  • The moderators intrusevly changed the content of this question, to the point where it even changed its original inquery. The post is on audio and they have changed it to video. They have remove important keywords that could help people find this post such as CNN and other ones. This is among some of my best ranking questions. I can't imagine how this kind of moderation could affect the community and ones with lesser ranks. @user4157124 – moeabdol Jul 01 '23 at 11:11
  • Just because it has been voted up does not make it a good question for this site. At present its asking for opinions, so I cannot vote for it to be reopened. – Rohit Gupta Jul 03 '23 at 04:09
  • @RohitGupta What you cosider a bad question could be considered good by a million other people. Its extremely disrespectful to change someone's original effort and still find the audacity to call it bad. – moeabdol Jul 03 '23 at 10:29
  • "The moderators intrusevly changed the content of this question" - did they? Check the revision history for this post, it does not contain any moderation edits. Also, as you haven't shared any code so far, how should others check where your approach went wrong? Finally, nobody called your question "bad" – Nico Haase Jul 06 '23 at 08:38
  • @NicoHaase oh yes. The moderators edited the question title removed audio and added video. Also deleted almost 90% of the content. Moreover when someone Rohit Gupta says its not a good question, so I'm assuming he means bad. Moderators can pretty much do whatever they like although clearly the community finds this question super helpful. We are not robots, we are humans and there will always be a human element in all questions asked and answered. Unless you want to get the best out of the community and build your next chat gpt or something. Sad! very sad. – moeabdol Jul 06 '23 at 20:45
  • Which moderator did that? The revision history shows that user4157124 edited the question, but that user is **not** a moderator. If you feel this edit caused more harm than good, roll back the edit. Also, Rohit stated this is not a "good question" (however we would define "good") **for this site**, not that this is a bad question per se – Nico Haase Jul 07 '23 at 06:18
  • Thank you @NicoHaase . The question is currently closed, but would you kindly assist me on how I can rollback the edits. I very much would appreciate your support. – moeabdol Jul 07 '23 at 10:05
  • 1
    There's a link to the revision history below the tags (currently named "edited Jun 22 at 19:30"). From there, you can perform a rollback – Nico Haase Jul 07 '23 at 10:36
  • Thank you @NicoHaase I rolled it back. Much appreciated. – moeabdol Jul 07 '23 at 18:30

2 Answers2

20

We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.

Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.

Hrant Khachatrian
  • 3,079
  • 24
  • 30
  • 2
    the "in this work" Microsoft link doesn't lead to any article or pdf, can you mention the title? – AlexGuevara Apr 07 '17 at 12:04
  • sorry for a late reply. Here it is https://scholar.google.com/citations?view_op=view_citation&hl=en&user=A979AbYAAAAJ&citation_for_view=A979AbYAAAAJ:ufrVoPGSRksC – Hrant Khachatrian Apr 18 '17 at 12:03
9

There are many techniques to extract feature vectors from audio data in order to train classifiers. The most commonly used is called MFCC (Mel-frequency cepstrum), which you can think of as a "improved" spectrogram, retaining more relevant information to discriminate between classes. Other commonly used technique is PLP (Perceptual Linear Predictive), which also gives good results. These are still many other less known.

More recently deep networks have been used to extract features vectors by themselves, thus more similarly the way we do in image recognition. This is a active area of research. Not long ago we also used feature extractors to train classifiers for images (SIFT, HOG, etc.), but these were replaced by deep learning techniques, which have raw images as inputs and extract feature vectors by themselves (indeed it's what deep learning is really all about).

It's also very important to notice that audio data is sequential. After training a classifier you need to train a sequential model as a HMM or CRF, which chooses the most likely sequences of speech units, using as input the probabilities given by your classifier.

A good starting point to learn speech recognition is Jursky and Martins: Speech and Language Processing. It explains very well all these concepts.

[EDIT: adding some potentially useful information]

There are many speech recognition toolkits with modules to extract MFCC feature vectors from audio files, but using than for this purpose is not always straightforward. I'm currently using CMU Sphinx4. It has a class named FeatureFileDumper, that can be used standalone to generate MFCC vectors from audio files.

Saul Berardo
  • 2,610
  • 3
  • 20
  • 24
  • 12
    spectrograms contain all the information what waves(the most direct representation of sound) have – Laie Jan 08 '15 at 12:20
  • Laie is correct, I am currently using the spectrogram approach and the first function I wrote was convert wav to spectrogram and then convert back to wav. It reproduces with 100% accuracy except for the first few and last few samples – ghostbust555 Nov 27 '16 at 22:53