I feel like this is a fairly common problem but I haven't yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which can be done heuristically by looking at pauses in the waveform, but can anyone point me to a function/library in python that does this automatically?

- 27,240
- 15
- 95
- 114

- 775
- 2
- 7
- 11
-
3You're looking for [`SpeechRecognition`](https://pypi.python.org/pypi/SpeechRecognition/), which explicitly has an example dedicated to [transcribing audio files](https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py). Next time, Google first :) – Akshat Mahajan Apr 06 '16 at 17:36
-
7I didn't ask for a function that can transcribe, but rather can split an audio file on the words, which although perhaps implicit in transcription, is not the same thing. I'm familiar with the SpeechRecognition package. – user3059201 Apr 06 '16 at 20:55
-
3There are no boundaries between words in real speech, you say "how are you" as a single chunk without any acoustic cues. If you want to split on words, you need to transcribe. – Nikolay Shmyrev Apr 06 '16 at 21:09
-
4That's not really true. If you look at any speech waveform, it's obvious where the words/pauses are. – user3059201 Apr 06 '16 at 21:11
-
2For most spoken languages, the boundaries between lexical units are difficult to identify... One might expect that the inter-word spaces used by many written languages... would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. https://en.wikipedia.org/wiki/Speech_segmentation – Nikolay Shmyrev Apr 07 '16 at 12:03
5 Answers
An easier way to do this is using pydub module. recent addition of silent utilities does all the heavy lifting such as setting up silence threahold
, setting up silence length
. etc and simplifies code significantly as opposed to other methods mentioned.
Here is an demo implementation , inspiration from here
Setup:
I had a audio file with spoken english letters from A
to Z
in the file "a-z.wav". A sub-directory splitAudio
was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.
Observations:
Some of the syllables were cut off, possibly needing modification of following parameters,
min_silence_len=500
silence_thresh=-16
One may want to tune these to one's own requirement.
Demo Code:
from pydub import AudioSegment
from pydub.silence import split_on_silence
sound_file = AudioSegment.from_wav("a-z.wav")
audio_chunks = split_on_silence(sound_file,
# must be silent for at least half a second
min_silence_len=500,
# consider it silent if quieter than -16 dBFS
silence_thresh=-16
)
for i, chunk in enumerate(audio_chunks):
out_file = ".//splitAudio//chunk{0}.wav".format(i)
print "exporting", out_file
chunk.export(out_file, format="wav")
Output:
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
exporting .//splitAudio//chunk0.wav
exporting .//splitAudio//chunk1.wav
exporting .//splitAudio//chunk2.wav
exporting .//splitAudio//chunk3.wav
exporting .//splitAudio//chunk4.wav
exporting .//splitAudio//chunk5.wav
exporting .//splitAudio//chunk6.wav
exporting .//splitAudio//chunk7.wav
exporting .//splitAudio//chunk8.wav
exporting .//splitAudio//chunk9.wav
exporting .//splitAudio//chunk10.wav
exporting .//splitAudio//chunk11.wav
exporting .//splitAudio//chunk12.wav
exporting .//splitAudio//chunk13.wav
exporting .//splitAudio//chunk14.wav
exporting .//splitAudio//chunk15.wav
exporting .//splitAudio//chunk16.wav
exporting .//splitAudio//chunk17.wav
exporting .//splitAudio//chunk18.wav
exporting .//splitAudio//chunk19.wav
exporting .//splitAudio//chunk20.wav
exporting .//splitAudio//chunk21.wav
exporting .//splitAudio//chunk22.wav
exporting .//splitAudio//chunk23.wav
exporting .//splitAudio//chunk24.wav
exporting .//splitAudio//chunk25.wav
exporting .//splitAudio//chunk26.wav
>>>
-
-
yup, have been looking into this problem myself but as 'pouya' mentioned, pydub or pyaudioanalysis will only work if there is a massive gap between words which will not be the case in any practical scenario!! the problem also runs in the opposite direction where some words may get broken into syllables if the speaker is not a native speaker and takes time to pronounce some words. – Deepak Agarwal Mar 22 '22 at 06:26
You could look at Audiolab It provides a decent API to convert the voice samples into numpy arrays. The Audiolab module uses the libsndfile C++ library to do the heavy lifting.
You can then parse the arrays to find the lower values to find the pauses.

- 39
- 4
Use IBM STT. Using timestamps=true
you will get the word break up along with when the system detects them to have been spoken.
There are a lot of other cool features like word_alternatives_threshold
to get other possibilities of words and word_confidence
to get the confidence with which the system predicts the word. Set word_alternatives_threshold
to between (0.1 and 0.01) to get a real idea.
This needs sign on, following which you can use the username and password generated.
The IBM STT is already a part of the speechrecognition module mentioned, but to get the word timestamp, you will need to modify the function.
An extracted and modified form looks like:
def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
word_confidence=False, word_alternatives_threshold=0.1):
assert isinstance(username, str), "``username`` must be a string"
assert isinstance(password, str), "``password`` must be a string"
flac_data = audio_data.get_flac_data(
convert_rate=None if audio_data.sample_rate >= 16000 else 16000, # audio samples should be at least 16 kHz
convert_width=None if audio_data.sample_width >= 2 else 2 # audio samples should be at least 16-bit
)
url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
"profanity_filter": "false",
"continuous": "true",
"model": "{}_BroadbandModel".format(language),
"timestamps": "{}".format(str(timestamps).lower()),
"word_confidence": "{}".format(str(word_confidence).lower()),
"word_alternatives_threshold": "{}".format(word_alternatives_threshold)
}))
request = Request(url, data=flac_data, headers={
"Content-Type": "audio/x-flac",
"X-Watson-Learning-Opt-Out": "true", # prevent requests from being logged, for improved privacy
})
authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
request.add_header("Authorization", "Basic {}".format(authorization_value))
try:
response = urlopen(request, timeout=None)
except HTTPError as e:
raise sr.RequestError("recognition request failed: {}".format(e.reason))
except URLError as e:
raise sr.RequestError("recognition connection failed: {}".format(e.reason))
response_text = response.read().decode("utf-8")
result = json.loads(response_text)
# return results
if show_all: return result
if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
raise Exception("Unknown Value Exception")
transcription = []
for utterance in result["results"]:
if "alternatives" not in utterance:
raise Exception("Unknown Value Exception. No Alternatives returned")
for hypothesis in utterance["alternatives"]:
if "transcript" in hypothesis:
transcription.append(hypothesis["transcript"])
return "\n".join(transcription)

- 2,237
- 27
- 30
- 38

- 878
- 1
- 11
- 18
pyAudioAnalysis can segment an audio file if the words are clearly separated (this is rarely the case in natural speech). The package is relatively easy to use:
python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3
More details on my blog.

- 2,991
- 2
- 33
- 60
My variant of function, which probably will be easier to modify for your needs:
from scipy.io.wavfile import write as write_wav
import numpy as np
import librosa
def zero_runs(a):
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
def split_in_parts(audio_path, out_dir):
# Some constants
min_length_for_silence = 0.01 # seconds
percentage_for_silence = 0.01 # eps value for silence
required_length_of_chunk_in_seconds = 60 # Chunk will be around this value not exact
sample_rate = 16000 # Set to None to use default
# Load audio
waveform, sampling_rate = librosa.load(audio_path, sr=sample_rate)
# Create mask of silence
eps = waveform.max() * percentage_for_silence
silence_mask = (np.abs(waveform) < eps).astype(np.uint8)
# Find where silence start and end
runs = zero_runs(silence_mask)
lengths = runs[:, 1] - runs[:, 0]
# Left only large silence ranges
min_length_for_silence = min_length_for_silence * sampling_rate
large_runs = runs[lengths > min_length_for_silence]
lengths = lengths[lengths > min_length_for_silence]
# Mark only center of silence
silence_mask[...] = 0
for start, end in large_runs:
center = (start + end) // 2
silence_mask[center] = 1
min_required_length = required_length_of_chunk_in_seconds * sampling_rate
chunks = []
prev_pos = 0
for i in range(min_required_length, len(waveform), min_required_length):
start = i
end = i + min_required_length
next_pos = start + silence_mask[start:end].argmax()
part = waveform[prev_pos:next_pos].copy()
prev_pos = next_pos
if len(part) > 0:
chunks.append(part)
# Add last part of waveform
part = waveform[prev_pos:].copy()
chunks.append(part)
print('Total chunks: {}'.format(len(chunks)))
new_files = []
for i, chunk in enumerate(chunks):
out_file = out_dir + "chunk_{}.wav".format(i)
print("exporting", out_file)
write_wav(out_file, sampling_rate, chunk)
new_files.append(out_file)
return new_files

- 3,652
- 3
- 22
- 27