5

I am trying to analyze a movie file by splitting it up into camera shots and then trying to determine which shots are more important than others. One of the factors I am considering in a shot's importance is how loud the volume is during that part of the movie. To do this, I am analyzing the corresponding sound file. I'm having trouble determining how "loud" a shot is because I don't think I fully understand what the data in a WAV file represents.

I read the file into an audio buffer using a method similar to that described in this post.

Having already split the corresponding video file into shots, I am now trying to find which shots are louder than others in the WAV file. I am trying to do this by extracting each sample in the file like this:

double amplitude = (double)((audioData[i] & 0xff) | (audioData[i + 1] << 8));

Some of the other posts I have read seem to indicate that I need to apply a Fast Fourier Transform to this audio data to get the amplitude, which makes me wonder what the values I have extracted actually represent. Is what I'm doing correct? My sound file format is a 16-bit mono PCM with a sampling rate of 22,050 Hz. Should I be doing something with this 22,050 value when I am trying to analyze the volume of the file? Other posts suggest using Root Mean Square to evaluate loudness. Is this required, or just a more accurate way of doing it?

The more I look into this the more confused I get. If anyone could shed some light on my mistakes and misunderstandings, I would greatly appreciate it!

Community
  • 1
  • 1
Steph
  • 2,135
  • 6
  • 31
  • 44

2 Answers2

3

The FFT has nothing to do with volume and everything to do with frequencies. To find out how loud a scene is on average, simply average the sampled values. Depending on whether you get the data as signed or unsigned values in your language, you might have to apply an absolute function first so that negative amplitudes don't cancel out the positive ones, but that's pretty much it. If you don't get the results you were expecting that must have to do with the way you are extracting the individual values in line 20.

That said, there are a few refinements that might or might not affect your task. Perceived loudness, amplitude and acoustic power are in fact related in non-linear ways, but as long as you are only trying to get a rough estimate of how much is "going on" in the audio signal I doubt that this is relevant for you. And of course, humans hear different frequencies better or worse - for instance, bats emit ultrasound squeals that would be absolutely deafening to us, but luckily we can't hear them at all. But again, I doubt this is relevant to your task, since e.g. frequencies above 22kHz (or was is 44kHz? not sure which) are in fact not representable in simple WAV format.

Kilian Foth
  • 13,904
  • 5
  • 39
  • 57
  • Okay, great. I was just concerned that I wasn't extracting the amplitude properly. But it sounds like I am. Out of curiousity, if I did care about the non-linear relationship between amplitude and acoustic power, would that be when I apply a FFT? – Steph Dec 05 '11 at 09:07
  • A flat-line value at the peak of the amplitudes represented by that format will sound exactly like a flat-line value of 0. Completely silent. Averaging the values is not the way to go. Either use RMS (my preferred choice), or calculate a dB level, for a more accurate value of 'volume'. – Andrew Thompson Dec 05 '11 at 09:18
  • @AndrewThompson - All right, so I'm starting to be convinced that RMS is a good idea. If I also want to take into account the non-linearity in the way the ear responds to frequencies and amplitudes (i.e. if I want to use an FFT), how do I do that in combination with RMS? Or would I have to do that instead of RMS? – Steph Dec 06 '11 at 08:42
  • That is beyond my level of experience. I've done some work on RMS (which is pretty simple to calculate, once you have int or float values) bu nothing beyond that. – Andrew Thompson Dec 06 '11 at 10:21
3

I don't know the level of accuracy you want, but a simple RMS (and perhaps simple filtering of the signal) is all many similar applications would need.

RMS will be much better than Peak amplitude. Using peak amplitudes is like determining the brightness of an image based on the brightest pixel, rather than averaging.

If you want to filter the signal or weigh it to perceived loudness, then you would need the sample rate for that.

FFT should not be required unless you want to do complex frequency analysis as well. The ear responds differently to frequencies at different amplitudes - the ear does not respond to sounds at different frequencies and amplitudes linearly. In this case, you could use FFT to perform frequency analyses for another domain of accuracy.

justin
  • 104,054
  • 14
  • 179
  • 226