Adding a pause in Google-text-to-speech

Question

I am looking for a small pause, wait, break or anything that will allow for a short break (looking for about 2 seconds +-, configurable would be ideal) when speaking out the desired text.

People online have said that adding three full stops followed by a space creates a break but I don't seem to be getting that. Code below is my test that has no pauses, sadly.. Any ideas or suggestions?

Edit: It would be ideal if there is some command from gTTS that would allow me to do this, or maybe some trick like using the three full stops if that actually worked.

from gtts import gTTS
import os

tts = gTTS(text=" Testing ... if there is a pause ... ... ... ... ...  longer pause? ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... insane pause   " , lang='en', slow=False)

tts.save("temp.mp3")
os.system("temp.mp3")

Peyman Majidi · Answer 1 · 2020-06-29T04:43:16.573

Ok, you need Speech Synthesis Markup Language (SSML) to achieve this.
Be aware you need to setting up Google Cloud Platform credentials

first in the bash:

pip install --upgrade google-cloud-texttospeech

Then here is the code:

import html
from google.cloud import texttospeech

def ssml_to_audio(ssml_text, outfile):
    # Instantiates a client
    client = texttospeech.TextToSpeechClient()

    # Sets the text input to be synthesized
    synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)

    # Builds the voice request, selects the language code ("en-US") and
    # the SSML voice gender ("MALE")
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.MALE
    )

    # Selects the type of audio file to return
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    # Performs the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )

    # Writes the synthetic audio to the output file.
    with open(outfile, "wb") as out:
        out.write(response.audio_content)
        print("Audio content written to file " + outfile)

def text_to_ssml(inputfile):

    raw_lines = inputfile

    # Replace special characters with HTML Ampersand Character Codes
    # These Codes prevent the API from confusing text with
    # SSML commands
    # For example, '<' --> '&lt;' and '&' --> '&amp;'

    escaped_lines = html.escape(raw_lines)

    # Convert plaintext to SSML
    # Wait two seconds between each address
    ssml = "<speak>{}</speak>".format(
        escaped_lines.replace("\n", '\n<break time="2s"/>')
    )

    # Return the concatenated string of ssml script
    return ssml



text = """Here are <say-as interpret-as="characters">SSML</say-as> samples.
  I can pause <break time="3s"/>.
  I can play a sound"""

ssml = text_to_ssml(text)
ssml_to_audio(ssml, "test.mp3")

More documentation:
Speaking addresses with SSML

But if you don't have Google Cloud Platform credentials, the cheaper and easier way is to use time.sleep(1) method

I'm on google-cloud-texttospeech v2.3.0 and now escaping the html characters isn't required now. If you escape the characters the synthesizer will read those characters out loud. — Arion_Miles, May 09 '21 at 15:35

score 3 · Answer 2 · answered Jan 20 '20 at 09:15

3

If there is any background waits required, you can use the time module to wait as below.

import time
# SLEEP FOR 5 SECONDS AND START THE PROCESS
time.sleep(5)

Or you can do a 3 time check with wait etc..

import time

for tries in range(3):
    if someprocess() is False:
        time.sleep(3)

answered Jan 20 '20 at 09:15

High-Octane

1,104
5
19

1

Thanks for this, but I guess I was looking for more of a solution from Google-text-to-speech, something like adding in the same line where the text to be said is, as sleep and time are already being used and it might conflict with original. Will give this a try though. – Barry Sturgeon Jan 20 '20 at 09:54
It is obvious that text to speech must offer something like this, the pacing is everything in speaking, so being able to add pauses is very basic – Markus Bawidamann Feb 16 '21 at 14:02

score 1 · Answer 3 · answered Jun 29 '20 at 03:44

You can save multiple mp3 files, then use time.sleep() to call each with your desired amount of pause:

from gtts import gTTS
import os
from time import sleep

tts1 = gTTS(text="Testingn" , lang='en', slow=False)
tts2 = gTTS(text="if there is a pause" , lang='en', slow=False)
tts3 = gTTS(text="insane pause   " , lang='en', slow=False)

tts1.save("temp1.mp3")
tts2.save("temp2.mp3")
tts3.save("temp3.mp3")

os.system("temp1.mp3")
sleep(2)
os.system("temp2.mp3")
sleep(3)
os.system("temp3.mp3")

score 1 · Answer 4 · answered Jun 29 '20 at 05:14

Sadly the answer is no, gTTS package has no additional function for pause,an issue already been created in 2018 for adding a pause function ,but it is smart enough to add natural pauses by tokenizer.

What is tokenizer?

Function that takes text and returns it split into a list of tokens (strings). In the gTTS context, its goal is to cut the text into smaller segments that do not exceed the maximum character size allowed(100) for each TTS API request, while making the speech sound natural and continuous. It does so by splitting text where speech would naturaly pause (for example on ".") while handling where it should not (for example on “10.5” or “U.S.A.”). Such rules are called tokenizer cases, which it takes a list of.

Here is an example:

text = "regular text speed no pause regular text speed comma pause, regular text speed period pause. regular text speed exclamation pause! regular text speed ellipses pause... regular text speed new line pause \n regular text speed "

So in this case, adding a sleep() seems like the only answer. But tricking the tokenizer is worth mentioning.

This is not the case anymore here in 2022 and I'm not sure if this was unavailable back in 2020. The correct answer is to use SSML tags to manipulate the output. The answer above with code examples shows how to do this. Simply adjust your SythesisInput(synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)) object passed into the TTS client and wrap your text with ..... Use the SSML tags as shown in documentation. There are many options but in this case the original question wants: , or . — akuma6099, Mar 07 '22 at 14:31

4Rom1 · Answer 5 · 2022-08-01T20:32:04.937

You can add arbitrary pause with Pydub by saving and concatenating temporary mp3. Then you can use a silent audio for your pause. You can use any break point symbols of your choice where you want to add pause (here $):

from pydub import AudioSegment
from gtts import gTTS

contents = "Hello with $$ 2 seconds pause"
contents.split("$") # I have chosen this symbol for the pause.
pause2s = AudioSegment.from_mp3("silent.mp3") 
# silent.mp3 contain 2s blank mp3 
cnt = 0
for p in parts:
       # The pause will happen for the empty element of the list
       if not p:
            combined += pause2s
       else:
            tts = gTTS(text=p , lang=langue, slow=False)
            tmpFileName="tmp"+str(cnt)+".mp3"
            tts.save(tmpFileName)
            combined+=AudioSegment.from_mp3(tmpFileName) 
       cnt+=1
                
combined.export("out.mp3", format="mp3")

score 0 · Answer 6 · answered Nov 01 '21 at 05:50

Late to the party here, but you might consider trying out the audio_program_generator package. You provide a text file comprised of individual phrases, each of which has a configurable pause at the end. In return, it gives you an mp3 file that 'stitches together' all the phrases and their pauses into one continuous audio file. You can optionally mix in a background sound-file, as well. And it implements several of the other bells and whistles that Google TTS provides, like accents, slow-play-speech, etc.

Disclaimer: I am the author of the package.

jproberts · Answer 7 · 2022-10-28T16:42:16.153

I had the same problem, and didn't want to use lots of temporary files on disk. This code parses an SSML file, and creates silence whenever a <break> tag is found:

import io
from gtts import gTTS

import lxml.etree as etree
import pydub

ssml_filename = 'Section12.35-edited.ssml'
wav_filename = 'Section12.35-edited.mp3'

events = ('end',)
DEFAULT_BREAK_TIME = 250

all_audio = pydub.AudioSegment.silent(100)

for event, element in etree.iterparse(
                                      ssml_filename,
                                      events=events,
                                      remove_comments=True,
                                      remove_pis=True,
                                      attribute_defaults=True,
                                     ):
    tag = etree.QName(element).localname
    if tag in ['p', 's'] and element.text:
        tts = gTTS(element.text, lang='en', tld='com.au')

        with io.BytesIO() as temp_bytes:
            tts.write_to_fp(temp_bytes)
            temp_bytes.seek(0)

            audio = pydub.AudioSegment.from_mp3(temp_bytes)
            all_audio = all_audio.append(audio)
    elif tag == 'break':
        # write silence to the file.
        time = element.attrib.get('time', None)  # Shouldn't be possible to have no time value.
        if time:
            if time.endswith('ms'):
                time_value = int(time.removesuffix('ms'))
            elif time.endswith('s'):
                time_value = int(time.removesuffix('s')) * 1000
            else:
                time_value = DEFAULT_BREAK_TIME
        else:
            time_value = DEFAULT_BREAK_TIME
        silence = pydub.AudioSegment.silent(time_value)
        all_audio = all_audio.append(silence)

with open(wav_filename, 'wb') as output_file:
    all_audio.export(output_file, format='mp3')

score 0 · Answer 8 · answered Feb 18 '23 at 06:38

I know 4Rom1 used this method above, but to put it more simply, I found this worked really well for me. Get a 1 sec silent mp3, I found one by googling 1 sec silent mp3. Then use pydub to add together audio segments however many times you need. For example to add 3 seconds of silence

from pydub import AudioSegment
seconds = 3
output = AudioSegment.from_file("yourfile.mp3")
output += AudioSegment.from_file("1sec_silence.mp3") * seconds
output.export("newaudio.mp3", format="mp3")

Adding a pause in Google-text-to-speech

8 Answers8

What is tokenizer?

Linked