1

I have following code that stores raw audio data from wav file in a byte buffer:

BYTE header[74];
fread(&header, sizeof(BYTE), 74, inputFile);
BYTE * sound_buffer;
DWORD data_size;

fread(&data_size, sizeof(DWORD), 1, inputFile);
sound_buffer = (BYTE *)malloc(sizeof(BYTE) * data_size);
fread(sound_buffer, sizeof(BYTE), data_size, inputFile);

Is there any algorithm to determine when the audio track is silent (literally no sound) and when there is some sound level?

SteveS
  • 366
  • 1
  • 4
  • 15

2 Answers2

3

Well, your "sound" will be an array of values, whether integer or real - depends on your format.

For the file to be silent or "have no sound" the values in that array will have to be zero, or very close to zero, or worst case scenario - if the audio has bias - the value will stay the same instead of fluctuating around to produce sound waves.

You can write a simple function which returns the delta for a range, in other words the difference between the largest and smallest value, the lower the delta the lower the sound volume.

Or alternatively, you can write a function that returns you the ranges in which the delta is lower than a given threshold.

For the sake of toying, I wrote a nifty class:

template<typename T>
class SilenceFinder {
public:
  SilenceFinder(T * data, uint size, uint samples) : sBegin(0), d(data), s(size), samp(samples), status(Undefined) {}

  std::vector<std::pair<uint, uint>> find(const T threshold, const uint window) {
    auto r = findSilence(d, s, threshold, window);
    regionsToTime(r);
    return r;
  }

private:
  enum Status {
    Silent, Loud, Undefined
  };

  void toggleSilence(Status st, uint pos, std::vector<std::pair<uint, uint>> & res) {
    if (st == Silent) {
        if (status != Silent) sBegin = pos;
        status = Silent;
      }
    else {
        if (status == Silent) res.push_back(std::pair<uint, uint>(sBegin, pos));
        status = Loud;
      }
  }

  void end(Status st, uint pos, std::vector<std::pair<uint, uint>> & res) {
    if ((status == Silent) && (st == Silent)) res.push_back(std::pair<uint, uint>(sBegin, pos));
  }

  static T delta(T * data, const uint window) {
    T min = std::numeric_limits<T>::max(), max = std::numeric_limits<T>::min();
    for (uint i = 0; i < window; ++i) {
        T c = data[i];
        if (c < min) min = c;
        if (c > max) max = c;
      }
    return max - min;
  }

  std::vector<std::pair<uint, uint>> findSilence(T * data, const uint size, const T threshold, const uint win) {
    std::vector<std::pair<uint, uint>> regions;
    uint window = win;
    uint pos = 0;
    Status s = Undefined;
    while ((pos + window) <= size) {
        if (delta(data + pos, window) < threshold) s = Silent;
        else s = Loud;
        toggleSilence(s, pos, regions);
        pos += window;
      }
    if (delta(data + pos, size - pos) < threshold) s = Silent;
    else s = Loud;
    end(s, pos, regions);
    return regions;
  }

  void regionsToTime(std::vector<std::pair<uint, uint>> & regions) {
    for (auto & r : regions) {
        r.first /= samp;
        r.second /= samp;
      }
  }

  T * d;
  uint sBegin, s, samp;
  Status status;
};

I haven't really tested it but it looks like it should work. However, it assumes a single audio channel, you will have to extend it in order to work with and across multichannel audio. Here is how you use it:

SilenceFinder<audioDataType> finder(audioDataPtr, sizeOfData, sampleRate);
auto res = finder.find(threshold, scanWindow);
// and output the silent regions
for (auto r : res) std::cout << r.first << " " << r.second << std::endl;

Also notice that the way it is implemented right now, the "cut" to silent regions will be very abrupt, such "noise gate" type of filers usually come with attack and release parameters, which smooth out the result. For example there might be 5 seconds of silence with just a tiny pop in the middle, without attack and release parameters, you will get the 5 minutes split in two, and the pop will actually remain, but using those you can implement varying sensitivity to when to cut it off.

dtech
  • 47,916
  • 17
  • 112
  • 190
  • So I guess I need to find silent part and look how PCM data look like, right? How do I know what array index responds to what time in track? – SteveS Mar 17 '15 at 01:21
  • You can calculate that if you know the sample rate. For example in 48Khz you will have 48000 samples for each second of audio. – dtech Mar 17 '15 at 01:23
  • I see, and does the number of samples get affected by stereo signal? – SteveS Mar 17 '15 at 01:29
  • No, stereo or surround - it only increases channel count, the sample rate is the same. – dtech Mar 17 '15 at 01:31
  • Ok, my video is 1:20:00 long. I multiply 4800 (duration of my video in seconds) by 48000 which is actually my sample rate and multiply it by 2 (because my video is 16bits per sample - so 2 bytes) and I should get the length of my array? – SteveS Mar 17 '15 at 01:42
  • Well, mostly yes, since multiple channels are usually interweaving, it should all be crammed in a single "stream" so to speak. – dtech Mar 17 '15 at 01:44
  • Number of channels seems to have impact on the array. In fact I have twice the amount of bytes than I counted above. So how does it work? Do the audio bytes pair in stream like this: Ch1;Ch2;Ch1;Ch2... ? – SteveS Mar 17 '15 at 01:52
  • I have only worked with raw mono audio, so I can't tell you from experience, but generally that's the idea of interweaving, although I am not sure whether the step is a single byte per channel. – dtech Mar 17 '15 at 01:56
  • Found a link to this: [link](http://stackoverflow.com/questions/13995936/what-is-a-channel-in-a-wav-file-formatdo-all-channels-play-simultaneaously-whe) – SteveS Mar 17 '15 at 01:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/73126/discussion-between-steves-and-ddriver). – SteveS Mar 17 '15 at 02:10
  • That's a lot of code, could you give me please more in-depth explanation? – SteveS Mar 17 '15 at 21:48
  • @SteveS - it is rather simple, the "window" is a region you scan and move along the file, the `delta()` function checks if the window is silent or not, and based on that and whether the previous window was silent or not, either a silent region is starting or ended and added to the vector of silent regions. If you have a silent window and the previous was loud, it means you begin silence, if you have a loud window and the previous was silent, it means you end silence and register it with the marked start and current position. That's it. – dtech Mar 17 '15 at 21:55
1

To check if the portion of the track between t1 and t2 is 'silent', compute the root mean square (RMS) of the samples between t1 and t2. Then, just check if the RMS is <= to some threshold value that you determine constitutes 'silence'. See http://en.wikipedia.org/wiki/Root_mean_square

mti2935
  • 11,465
  • 3
  • 29
  • 33
  • RMS finds AC power, but does not work on DC. His raw data may have DC bias, and moreover there's no need to find the specific power level. – Potatoswatter Mar 17 '15 at 01:12
  • Good point about the DC bias. To account for this, it would be a good idea to first apply a high-pass filter to the data, with a low cutoff frequency (say 10 hz or so). After that, the RMS will be proportional to the power level, which will be proportional to the volume of the sound coming from the speaker. – mti2935 Mar 17 '15 at 09:54