Split speech audio file on words in python

Question

I feel like this is a fairly common problem but I haven't yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which can be done heuristically by looking at pauses in the waveform, but can anyone point me to a function/library in python that does this automatically?

You're looking for [`SpeechRecognition`](https://pypi.python.org/pypi/SpeechRecognition/), which explicitly has an example dedicated to [transcribing audio files](https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py). Next time, Google first :) — Akshat Mahajan, Apr 06 '16 at 17:36
I didn't ask for a function that can transcribe, but rather can split an audio file on the words, which although perhaps implicit in transcription, is not the same thing. I'm familiar with the SpeechRecognition package. — user3059201, Apr 06 '16 at 20:55
There are no boundaries between words in real speech, you say "how are you" as a single chunk without any acoustic cues. If you want to split on words, you need to transcribe. — Nikolay Shmyrev, Apr 06 '16 at 21:09
That's not really true. If you look at any speech waveform, it's obvious where the words/pauses are. — user3059201, Apr 06 '16 at 21:11
For most spoken languages, the boundaries between lexical units are difficult to identify... One might expect that the inter-word spaces used by many written languages... would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. https://en.wikipedia.org/wiki/Speech_segmentation — Nikolay Shmyrev, Apr 07 '16 at 12:03

score 46 · Accepted Answer · edited May 23 '17 at 11:47

An easier way to do this is using pydub module. recent addition of silent utilities does all the heavy lifting such as setting up silence threahold , setting up silence length. etc and simplifies code significantly as opposed to other methods mentioned.

Here is an demo implementation , inspiration from here

Setup:

I had a audio file with spoken english letters from A to Z in the file "a-z.wav". A sub-directory splitAudio was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.

Observations: Some of the syllables were cut off, possibly needing modification of following parameters,
min_silence_len=500
silence_thresh=-16

One may want to tune these to one's own requirement.

Demo Code:

from pydub import AudioSegment
from pydub.silence import split_on_silence

sound_file = AudioSegment.from_wav("a-z.wav")
audio_chunks = split_on_silence(sound_file, 
    # must be silent for at least half a second
    min_silence_len=500,

    # consider it silent if quieter than -16 dBFS
    silence_thresh=-16
)

for i, chunk in enumerate(audio_chunks):

    out_file = ".//splitAudio//chunk{0}.wav".format(i)
    print "exporting", out_file
    chunk.export(out_file, format="wav")

Output:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>> 
exporting .//splitAudio//chunk0.wav
exporting .//splitAudio//chunk1.wav
exporting .//splitAudio//chunk2.wav
exporting .//splitAudio//chunk3.wav
exporting .//splitAudio//chunk4.wav
exporting .//splitAudio//chunk5.wav
exporting .//splitAudio//chunk6.wav
exporting .//splitAudio//chunk7.wav
exporting .//splitAudio//chunk8.wav
exporting .//splitAudio//chunk9.wav
exporting .//splitAudio//chunk10.wav
exporting .//splitAudio//chunk11.wav
exporting .//splitAudio//chunk12.wav
exporting .//splitAudio//chunk13.wav
exporting .//splitAudio//chunk14.wav
exporting .//splitAudio//chunk15.wav
exporting .//splitAudio//chunk16.wav
exporting .//splitAudio//chunk17.wav
exporting .//splitAudio//chunk18.wav
exporting .//splitAudio//chunk19.wav
exporting .//splitAudio//chunk20.wav
exporting .//splitAudio//chunk21.wav
exporting .//splitAudio//chunk22.wav
exporting .//splitAudio//chunk23.wav
exporting .//splitAudio//chunk24.wav
exporting .//splitAudio//chunk25.wav
exporting .//splitAudio//chunk26.wav
>>>

There should be a significant gap between words using this method. — pouya, Sep 29 '21 at 17:14
yup, have been looking into this problem myself but as 'pouya' mentioned, pydub or pyaudioanalysis will only work if there is a massive gap between words which will not be the case in any practical scenario!! the problem also runs in the opposite direction where some words may get broken into syllables if the speaker is not a native speaker and takes time to pronounce some words. — Deepak Agarwal, Mar 22 '22 at 06:26

score 3 · Answer 2 · answered Apr 06 '16 at 17:42

3

You could look at Audiolab It provides a decent API to convert the voice samples into numpy arrays. The Audiolab module uses the libsndfile C++ library to do the heavy lifting.

You can then parse the arrays to find the lower values to find the pauses.

answered Apr 06 '16 at 17:42

Piyush Sharma

39
4

score 3 · Answer 3 · edited Apr 10 '17 at 02:50

Use IBM STT. Using timestamps=true you will get the word break up along with when the system detects them to have been spoken.

There are a lot of other cool features like word_alternatives_threshold to get other possibilities of words and word_confidence to get the confidence with which the system predicts the word. Set word_alternatives_threshold to between (0.1 and 0.01) to get a real idea.

This needs sign on, following which you can use the username and password generated.

The IBM STT is already a part of the speechrecognition module mentioned, but to get the word timestamp, you will need to modify the function.

An extracted and modified form looks like:

def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
                                word_confidence=False, word_alternatives_threshold=0.1):
    assert isinstance(username, str), "``username`` must be a string"
    assert isinstance(password, str), "``password`` must be a string"

    flac_data = audio_data.get_flac_data(
        convert_rate=None if audio_data.sample_rate >= 16000 else 16000,  # audio samples should be at least 16 kHz
        convert_width=None if audio_data.sample_width >= 2 else 2  # audio samples should be at least 16-bit
    )
    url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
        "profanity_filter": "false",
        "continuous": "true",
        "model": "{}_BroadbandModel".format(language),
        "timestamps": "{}".format(str(timestamps).lower()),
        "word_confidence": "{}".format(str(word_confidence).lower()),
        "word_alternatives_threshold": "{}".format(word_alternatives_threshold)
    }))
    request = Request(url, data=flac_data, headers={
        "Content-Type": "audio/x-flac",
        "X-Watson-Learning-Opt-Out": "true",  # prevent requests from being logged, for improved privacy
    })
    authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
    request.add_header("Authorization", "Basic {}".format(authorization_value))

    try:
        response = urlopen(request, timeout=None)
    except HTTPError as e:
        raise sr.RequestError("recognition request failed: {}".format(e.reason))
    except URLError as e:
        raise sr.RequestError("recognition connection failed: {}".format(e.reason))
    response_text = response.read().decode("utf-8")
    result = json.loads(response_text)

    # return results
    if show_all: return result
    if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
        raise Exception("Unknown Value Exception")

    transcription = []
    for utterance in result["results"]:
        if "alternatives" not in utterance:
            raise Exception("Unknown Value Exception. No Alternatives returned")
        for hypothesis in utterance["alternatives"]:
            if "transcript" in hypothesis:
                transcription.append(hypothesis["transcript"])
    return "\n".join(transcription)

score 2 · Answer 4 · answered Feb 11 '19 at 00:49

pyAudioAnalysis can segment an audio file if the words are clearly separated (this is rarely the case in natural speech). The package is relatively easy to use:

python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3

More details on my blog.

score 1 · Answer 5 · answered Oct 04 '21 at 16:30

My variant of function, which probably will be easier to modify for your needs:

from scipy.io.wavfile import write as write_wav
import numpy as np
import librosa

def zero_runs(a):
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges

def split_in_parts(audio_path, out_dir):
    # Some constants
    min_length_for_silence = 0.01 # seconds
    percentage_for_silence = 0.01 # eps value for silence
    required_length_of_chunk_in_seconds = 60 # Chunk will be around this value not exact
    sample_rate = 16000 # Set to None to use default

    # Load audio
    waveform, sampling_rate = librosa.load(audio_path, sr=sample_rate)

    # Create mask of silence
    eps = waveform.max() * percentage_for_silence
    silence_mask = (np.abs(waveform) < eps).astype(np.uint8)

    # Find where silence start and end
    runs = zero_runs(silence_mask)
    lengths = runs[:, 1] - runs[:, 0]

    # Left only large silence ranges
    min_length_for_silence = min_length_for_silence * sampling_rate
    large_runs = runs[lengths > min_length_for_silence]
    lengths = lengths[lengths > min_length_for_silence]

    # Mark only center of silence
    silence_mask[...] = 0
    for start, end in large_runs:
        center = (start + end) // 2
        silence_mask[center] = 1

    min_required_length = required_length_of_chunk_in_seconds * sampling_rate
    chunks = []
    prev_pos = 0
    for i in range(min_required_length, len(waveform), min_required_length):
        start = i
        end = i + min_required_length
        next_pos = start + silence_mask[start:end].argmax()
        part = waveform[prev_pos:next_pos].copy()
        prev_pos = next_pos
        if len(part) > 0:
            chunks.append(part)

    # Add last part of waveform
    part = waveform[prev_pos:].copy()
    chunks.append(part)
    print('Total chunks: {}'.format(len(chunks)))

    new_files = []
    for i, chunk in enumerate(chunks):
        out_file = out_dir + "chunk_{}.wav".format(i)
        print("exporting", out_file)
        write_wav(out_file, sampling_rate, chunk)
        new_files.append(out_file)

    return new_files

Split speech audio file on words in python

5 Answers5

Linked