4

I have many audio files where I would like to automatically add timestamps when speech begins and ends. So a "start" timestamp when an utterance begins. And a "stop" timestamp when an utterance ends.

Like:

start,stop
0:00:02.40,0:00:11.18
0:00:18.68,0:00:19.77
...

I tested the following solution and it works ok: Split audio files using silence detection Issue is I only get the chunks from this, which makes matching the timestamps against the original audio, somewhat difficult

Any solutions or nudges in the right direction would be highly appreciated!

NicolaiF
  • 1,283
  • 1
  • 20
  • 44
  • 1
    Is it OK if the algorithm only picks up on sound power beyond certain threshold? Or it should ignore noise and only pick up on actual speech? The latter is much more complex. – Lukasz Tracewski Nov 30 '19 at 09:30
  • just sound power beyond a certain threshold - as in the solution provided in the link. – NicolaiF Dec 01 '19 at 21:07
  • Hi @NicolaiF, please post your code (even if coming from another question if license allows it). Please also explain what is the type of your expected result. Is it a text file? It seems that you want to *add* timestamps to the original audio file in your question. – Tim Dec 02 '19 at 16:39

2 Answers2

6

Ideally applying ML algorithms with comprehensive test/train data will yield a dynamic solution that may not need any manual tuning for silence length and threshold vaules.

However, a simple static solution can be devised using pydub's detect_nonsilent method. This method returns start & stop times for non-silent chunks in a continuous manner.

Following parameters affects result that may need some tuning.

min_silence_len : minimum silence length in ms that you expect in audio.
silence_thresh : anything below this threhold is considered silence.

while trying, I did notice that it helps a lot to normalize audio before running through detect_nonsilent method, probably because gain is applied to achieve an average amplitude level which makes detecting silence much easier.

Sample audio file is downloaded from open speech repo . Each audio file has 10 spoken sentences with some gap in between.

Here is an working demo code:

from pydub import AudioSegment
from pydub.silence import detect_nonsilent

#adjust target amplitude
def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

#Convert wav to audio_segment
audio_segment = AudioSegment.from_wav("OSR_us_000_0010_8k.wav")

#normalize audio_segment to -20dBFS 
normalized_sound = match_target_amplitude(audio_segment, -20.0)
print("length of audio_segment={} seconds".format(len(normalized_sound)/1000))

#Print detected non-silent chunks, which in our case would be spoken words.
nonsilent_data = detect_nonsilent(normalized_sound, min_silence_len=500, silence_thresh=-20, seek_step=1)

#convert ms to seconds
print("start,Stop")
for chunks in nonsilent_data:
    print( [chunk/1000 for chunk in chunks])

Result:

root# python nonSilence.py 
length of audio_segment=33.623 seconds
start,Stop
[0.81, 2.429]
[4.456, 5.137]
[8.084, 8.668]
[11.035, 12.334]
[14.387, 15.601]
[17.594, 18.133]
[20.733, 21.289]
[24.007, 24.066]
[27.372, 27.977]
[30.361, 30.996]

As seen in audacity (difference shown below), our result are close to within 0.1 - 0.4 sec offset. Tuning detect_nonsilent arguments may help.

Count From Script   From Audacity
1   0.81-2.429      0.573-2.833
2   4.456-5.137     4.283-6.421
3   8.084-8.668     7.824-9.679
4   11.035-12.334   10.994-12.833
5   14.387-15.601   14.367-16.120
6   17.594-18.133   17.3-19.021
7   20.773-21.289   20.471-22.258
8   24.007-24.066   23.843-25.664
9   27.372-27.977   27.081-28.598
10  30.361, 30.996  30.015-32.240

enter image description here

Anil_M
  • 10,893
  • 6
  • 47
  • 74
  • Works well! You are mentioning ML algorithms, I have been looking into some mainly diarization methods such as the ones listed here:https://github.com/wq2012/awesome-diarization. Do you have any suggestions for algorithms or approaches? – NicolaiF Dec 03 '19 at 10:28
  • 1
    Glad it worked well. You may want to look into pyAudioAnalysis that has built in ML methods with diarization support. Its pain to get it working, but once u do it works like magic. https://github.com/tyiannak/pyAudioAnalysis/wiki – Anil_M Dec 03 '19 at 16:22
0

You can do similar to what that pydub solution you posted above did but instead use the detect_silence function (from pydub.silence import detect_silence) that will give you "silent ranges" which are the start and stop of each silent period. The negative image of that - the starts as stops and the stops as starts - are the periods of non-silence. Someone shows an example of using detect_silence here

EDIT:
Here is the example code from the link (just in case that link would go down):

def test_realistic_audio(self):
    silent_ranges = detect_silence(self.seg4, min_silence_len=1000, silence_thresh=self.seg4.dBFS)

    prev_end = -1
    for start, end in silent_ranges:
        self.assertTrue(start > prev_end)
        prev_end = end
K.Mulier
  • 8,069
  • 15
  • 79
  • 141
CognizantApe
  • 1,149
  • 10
  • 10