21

I'm generating speech through Google Cloud's text-to-speech API and I'd like to highlight words as they are spoken.

Is there a way of getting timestamps for spoken words or sentences?

user2248702
  • 2,741
  • 7
  • 41
  • 69
  • I don't think you can do this with google cloud yet but if you using an android device and the google TextToSpeech engine, you can do this: https://stackoverflow.com/questions/59488998/highlighting-the-text-while-speech-is-progressing – Nerdy Bunz Jan 06 '20 at 22:23
  • You can break the sentences into words as tokens and create or hightlight the words by your own code. You also have to config the settings properly and maybe have to use thread for for sending multiple words at a same time. Can you please share the code. – Akash Badam Jan 10 '20 at 14:01

3 Answers3

6

You can do this using SSML and v1beta1 version of Google Cloud's text-to-speech API: https://cloud.google.com/text-to-speech/docs/reference/rest/v1beta1/text/synthesize#TimepointType

  1. Add <mark> SSML tags to the point in the text that you want a timestamp for (maybe at the end of each sentence).
  2. Set TimepointType to SSML_MARK. If this field is not set, timepoints are not returned by default.
i_am_momo
  • 91
  • 1
  • 2
  • I wonder, would this work if one wanted to offer word highlighting? Is it practical to put marks by every word? – Bret Sep 02 '21 at 19:12
3

Google's text-to-speech API supports this in the v1beta1 release, at the time of writing.

In Python (as an example) you will need to change the import from:

from google.cloud import texttospeech as tts

to:

from google.cloud import texttospeech_v1beta1 as tts

You must use SSML, not plain text, and use <mark>'s in the XML.

The synthesis request needs the enable_time_pointing flag to be set. In Python this looks like:

    response = client.synthesize_speech(
        request=tts.SynthesizeSpeechRequest(
            ...
            enable_time_pointing=[
                tts.SynthesizeSpeechRequest.TimepointType.SSML_MARK]
        )
    )

For a runnable example, see my answer on this question.

Andrew E
  • 7,697
  • 3
  • 42
  • 38
2

This question seems to have gotten quite popular so I thought I'd share what I ended up doing. This method will probably only work with English or similar languages.

I first split text on any punctuation that causes a break in speaking. Each "sentence" is converted to speech separately. The resulting audio files have a seemingly random amount of silence at the end which needs to be removed before joining them, this can be done with the FFmpeg silencedetect filter. You can then join the audio files with an appropriate gap. Approximate word timestamps can be linearly interpolated within the sentences.

user2248702
  • 2,741
  • 7
  • 41
  • 69