I'm trying to prepare some audio files for my neural network. In order to remove the silent parts I used a solution a user posted hier on SO:
def detect_leading_silence(sound, silence_threshold=-50.0, chunk_size=10):
'''
sound is a pydub.AudioSegment
silence_threshold in dB
chunk_size in ms
iterate over chunks until you find the first one with sound
'''
trim_ms = 0 # ms
assert chunk_size > 0 # to avoid infinite loop
while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold and trim_ms < len(sound):
trim_ms += chunk_size
return trim_ms
This works good, but not perfect. When I plot the audio signal you can see that there is still some noise/silence at the beginning of the audio. Additionally it's not really dynamic as the threshold is a fixed value, but for some other audio files I need a higher/lower threshold.
How can I achieve that only the voice/speech is extracted?