Getting timestamps from audio using python

Question

I have many audio files where I would like to automatically add timestamps when speech begins and ends. So a "start" timestamp when an utterance begins. And a "stop" timestamp when an utterance ends.

Like:

start,stop
0:00:02.40,0:00:11.18
0:00:18.68,0:00:19.77
...

I tested the following solution and it works ok: Split audio files using silence detection Issue is I only get the chunks from this, which makes matching the timestamps against the original audio, somewhat difficult

Any solutions or nudges in the right direction would be highly appreciated!

Is it OK if the algorithm only picks up on sound power beyond certain threshold? Or it should ignore noise and only pick up on actual speech? The latter is much more complex. — Lukasz Tracewski, Nov 30 '19 at 09:30
just sound power beyond a certain threshold - as in the solution provided in the link. — NicolaiF, Dec 01 '19 at 21:07
Hi @NicolaiF, please post your code (even if coming from another question if license allows it). Please also explain what is the type of your expected result. Is it a text file? It seems that you want to *add* timestamps to the original audio file in your question. — Tim, Dec 02 '19 at 16:39

Anil_M · Accepted Answer · 2019-12-03T16:19:22.383

Ideally applying ML algorithms with comprehensive test/train data will yield a dynamic solution that may not need any manual tuning for silence length and threshold vaules.

However, a simple static solution can be devised using pydub's detect_nonsilent method. This method returns start & stop times for non-silent chunks in a continuous manner.

Following parameters affects result that may need some tuning.

min_silence_len : minimum silence length in ms that you expect in audio.
silence_thresh : anything below this threhold is considered silence.

while trying, I did notice that it helps a lot to normalize audio before running through detect_nonsilent method, probably because gain is applied to achieve an average amplitude level which makes detecting silence much easier.

Sample audio file is downloaded from open speech repo . Each audio file has 10 spoken sentences with some gap in between.

Here is an working demo code:

from pydub import AudioSegment
from pydub.silence import detect_nonsilent

#adjust target amplitude
def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

#Convert wav to audio_segment
audio_segment = AudioSegment.from_wav("OSR_us_000_0010_8k.wav")

#normalize audio_segment to -20dBFS 
normalized_sound = match_target_amplitude(audio_segment, -20.0)
print("length of audio_segment={} seconds".format(len(normalized_sound)/1000))

#Print detected non-silent chunks, which in our case would be spoken words.
nonsilent_data = detect_nonsilent(normalized_sound, min_silence_len=500, silence_thresh=-20, seek_step=1)

#convert ms to seconds
print("start,Stop")
for chunks in nonsilent_data:
    print( [chunk/1000 for chunk in chunks])

Result:

root# python nonSilence.py 
length of audio_segment=33.623 seconds
start,Stop
[0.81, 2.429]
[4.456, 5.137]
[8.084, 8.668]
[11.035, 12.334]
[14.387, 15.601]
[17.594, 18.133]
[20.733, 21.289]
[24.007, 24.066]
[27.372, 27.977]
[30.361, 30.996]

As seen in audacity (difference shown below), our result are close to within 0.1 - 0.4 sec offset. Tuning detect_nonsilent arguments may help.

Count From Script   From Audacity
1   0.81-2.429      0.573-2.833
2   4.456-5.137     4.283-6.421
3   8.084-8.668     7.824-9.679
4   11.035-12.334   10.994-12.833
5   14.387-15.601   14.367-16.120
6   17.594-18.133   17.3-19.021
7   20.773-21.289   20.471-22.258
8   24.007-24.066   23.843-25.664
9   27.372-27.977   27.081-28.598
10  30.361, 30.996  30.015-32.240

Works well! You are mentioning ML algorithms, I have been looking into some mainly diarization methods such as the ones listed here:https://github.com/wq2012/awesome-diarization. Do you have any suggestions for algorithms or approaches? — NicolaiF, Dec 03 '19 at 10:28
Glad it worked well. You may want to look into pyAudioAnalysis that has built in ML methods with diarization support. Its pain to get it working, but once u do it works like magic. https://github.com/tyiannak/pyAudioAnalysis/wiki — Anil_M, Dec 03 '19 at 16:22

score 0 · Answer 2 · edited Dec 03 '19 at 05:42

You can do similar to what that pydub solution you posted above did but instead use the detect_silence function (from pydub.silence import detect_silence) that will give you "silent ranges" which are the start and stop of each silent period. The negative image of that - the starts as stops and the stops as starts - are the periods of non-silence. Someone shows an example of using detect_silence here

EDIT:
Here is the example code from the link (just in case that link would go down):

def test_realistic_audio(self):
    silent_ranges = detect_silence(self.seg4, min_silence_len=1000, silence_thresh=self.seg4.dBFS)

    prev_end = -1
    for start, end in silent_ranges:
        self.assertTrue(start > prev_end)
        prev_end = end

Getting timestamps from audio using python

2 Answers2