Ideally applying ML algorithms with comprehensive test/train data will yield a dynamic solution that may not need any manual tuning for silence length and threshold vaules.
However, a simple static solution can be devised using pydub's detect_nonsilent method. This method returns start & stop times for non-silent chunks in a continuous manner.
Following parameters affects result that may need some tuning.
min_silence_len : minimum silence length in ms that you expect in audio.
silence_thresh : anything below this threhold is considered silence.
while trying, I did notice that it helps a lot to normalize audio before running through detect_nonsilent method, probably because gain is applied to achieve an average amplitude level which makes detecting silence much easier.
Sample audio file is downloaded from open speech repo . Each audio file has 10 spoken sentences with some gap in between.
Here is an working demo code:
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
#adjust target amplitude
def match_target_amplitude(sound, target_dBFS):
change_in_dBFS = target_dBFS - sound.dBFS
return sound.apply_gain(change_in_dBFS)
#Convert wav to audio_segment
audio_segment = AudioSegment.from_wav("OSR_us_000_0010_8k.wav")
#normalize audio_segment to -20dBFS
normalized_sound = match_target_amplitude(audio_segment, -20.0)
print("length of audio_segment={} seconds".format(len(normalized_sound)/1000))
#Print detected non-silent chunks, which in our case would be spoken words.
nonsilent_data = detect_nonsilent(normalized_sound, min_silence_len=500, silence_thresh=-20, seek_step=1)
#convert ms to seconds
print("start,Stop")
for chunks in nonsilent_data:
print( [chunk/1000 for chunk in chunks])
Result:
root# python nonSilence.py
length of audio_segment=33.623 seconds
start,Stop
[0.81, 2.429]
[4.456, 5.137]
[8.084, 8.668]
[11.035, 12.334]
[14.387, 15.601]
[17.594, 18.133]
[20.733, 21.289]
[24.007, 24.066]
[27.372, 27.977]
[30.361, 30.996]
As seen in audacity (difference shown below), our result are close to within 0.1 - 0.4 sec offset. Tuning detect_nonsilent arguments may help.
Count From Script From Audacity
1 0.81-2.429 0.573-2.833
2 4.456-5.137 4.283-6.421
3 8.084-8.668 7.824-9.679
4 11.035-12.334 10.994-12.833
5 14.387-15.601 14.367-16.120
6 17.594-18.133 17.3-19.021
7 20.773-21.289 20.471-22.258
8 24.007-24.066 23.843-25.664
9 27.372-27.977 27.081-28.598
10 30.361, 30.996 30.015-32.240
