How to get SSML timestamps from Google Cloud text-to-speech API

Question

I want to use SSML markers through the Google Cloud text-to-speech API to request the timing of these markers in the audio stream. These timestamps are necessary in order to provide cues for effects, word/section highlighting and feedback to the user.

I found this question which is relevant, although the question refers to the timestamps for each word and not the SSML <mark> tag.

The following API request returns OK but shows the lack of the requested marker data. This is using the Cloud Text-to-Speech API v1.

{
 "voice": {
  "languageCode": "en-US"
 },
 "input": {
  "ssml": "<speak>First, <mark name=\"a\"/> second, <mark name=\"b\"/> third.</speak>"
 },
 "audioConfig": {
  "audioEncoding": "mp3"
 }
}

Response:

{
 "audioContent":"//NExAAAAANIAAAAABcFAThYGJqMWA..."
}

Which only provides the synthesized audio without any contextual information.

Is there an API request that I am overlooking which can expose information about these markers such as is the case with IBM Watson and Amazon Polly?

Did you find a solution for this? Looks like Google's api doesn't support speech marks. Correct? — Bret, Jul 09 '20 at 18:50

score 4 · Answer 1 · answered Oct 09 '21 at 03:31

At the time of writing, the timepoint data is available in the v1beta1 release of Google cloud text-to-speech.

I didn't need to sign on to any extra developer program in order to access the beta, beyond the default access.

Importing in Python (for example) went from:

from google.cloud import texttospeech as tts

to:

from google.cloud import texttospeech_v1beta1 as tts

Nice and simple.

I needed to modify the default way I was sending the synthesis request to include the enable_time_pointing flag.

I found that with a mix of poking around the machine-readable API description here and reading the Python library code, which I had already downloaded.

Thankfully, the source in the generally available version also includes the v1beta version - thank you Google!

I've put a runnable sample below. Running this needs the same auth and setup you'll need for a general text-to-speech sample, which you can get by following the official documentation.

Here's what it does for me (with slight formatting for readability):

$ python tools/try-marks.py
Marks content written to file: .../demo.json
Audio content written to file: .../demo.mp3

$ cat demo.json
[
  {"sec": 0.4300000071525574, "name": "here"},
  {"sec": 0.9234582781791687, "name": "there"}
]

Here's the sample:

import json
from pathlib import Path
from google.cloud import texttospeech_v1beta1 as tts


def go_ssml(basename: Path, ssml):
    client = tts.TextToSpeechClient()
    voice = tts.VoiceSelectionParams(
        language_code="en-AU",
        name="en-AU-Wavenet-B",
        ssml_gender=tts.SsmlVoiceGender.MALE,
    )

    response = client.synthesize_speech(
        request=tts.SynthesizeSpeechRequest(
            input=tts.SynthesisInput(ssml=ssml),
            voice=voice,
            audio_config=tts.AudioConfig(audio_encoding=tts.AudioEncoding.MP3),
            enable_time_pointing=[
                tts.SynthesizeSpeechRequest.TimepointType.SSML_MARK]
        )
    )

    # cheesy conversion of array of Timepoint proto.Message objects into plain-old data
    marks = [dict(sec=t.time_seconds, name=t.mark_name)
             for t in response.timepoints]

    name = basename.with_suffix('.json')
    with name.open('w') as out:
        json.dump(marks, out)
        print(f'Marks content written to file: {name}')

    name = basename.with_suffix('.mp3')
    with name.open('wb') as out:
        out.write(response.audio_content)
        print(f'Audio content written to file: {name}')


go_ssml(Path.cwd() / 'demo', """
    <speak>
    Go from <mark name="here"/> here, to <mark name="there"/> there!
    </speak>
    """)

For any one who uses node, `const client = new textToSpeech.v1beta1.TextToSpeechClient();` instead of `const client = new textToSpeech.TextToSpeechClient() ` — ATP, May 15 '23 at 18:42

score 3 · Answer 2 · answered Oct 01 '20 at 08:37

3

Looks like this is supported in Cloud Text-to-Speech API v1beta1: https://cloud.google.com/text-to-speech/docs/reference/rest/v1beta1/text/synthesize#TimepointType

You can use https://texttospeech.googleapis.com/v1beta1/text:synthesize. Set TimepointType to SSML_MARK. If this field is not set, timepoints are not returned by default.

answered Oct 01 '20 at 08:37

i_am_momo

91
1
2

How to write this? " TimepointType: "SSML_MARK"? – MAZ Aug 16 '21 at 21:19

How to get SSML timestamps from Google Cloud text-to-speech API

2 Answers2

Linked