Generate timed-text synchronised with Text-to-Speech word-by-word?

Question

How can I generate timed-text (e.g. for subtitles) synchronised with Text-to-Speech (TTS) word-by-word?

I'd like to do this using the high quality SAPI5 voices (e.g. those available from IVONA here) and that I have used on Windows 10.

On Windows we already have some good free TTS programs:

Read4Me - open source
Balabolka - closed source
TTSApp Microsoft's own very basic GUI - currently available here - it seems to date from 2001.

TTSApp can produce audio files in WAV. Balabolka creates MP3 files along with synchronised timed-text as LRC files used in Karaoke - BUT only on line-by-line basis NOT word-by-word.
However, both show word-by-word highlighting while they speak aloud on screen - in real time.

If I had some TTS/SAPI5 source code I could simply check the clock every time a new word starts to be generated and write the time and that word to a file. Does anyone know of any project that exposes that level of programming - so I might start from there?

UPDATE SEPT 2016

I've since discovered the TTSApp was reimplemented using AutoHotKey by a certain jballi in 2012.

I've adapted that code to append to a text file the time in ms every time the onWord event handler fires. Still I need to make two passes:

a rapid automated pass to save the WAV file and
a slow (realtime) pass that creates the timing file.

I am still hoping to find a way to accelerate step 2.

BTW The VisualBasic source appears to be archived here.

I was looking around and found [this](http://www.annosoft.com/sapi_lipsync/docs/classsapi__textbased__lipsync_a4.html) which might help. You'll definitely have to call ISpRecoResult::GetResultTimes if you need more accuracy than SPEI_SOUND_START and SPEI_SOUND_END — Lesley Gushurst, Mar 15 '16 at 21:41
Thanks Lesley Gushurst - I'll check out that SAPI 5.1 Lipsynccode from Annosoft. — GavinBrelstaff, Mar 16 '16 at 12:57
Now I see - the Lipsync program is solving a subtly different problem. It is producing timed-text yes - but it isn't synthesizing the voice audio at the same time. — GavinBrelstaff, Mar 16 '16 at 13:11
Hmmm... Alright, so if you're already using the Speak call in TTS, have you looked into SPEI_WORD_BOUNDARY at all? — Lesley Gushurst, Mar 16 '16 at 15:00

GavinBrelstaff · Accepted Answer · 2016-09-20T16:54:01.710

It is possible to do all of this offline!

You generate a WAV file using SAPI while specifying DoEvents - documented here.

A binary representation of each event (e.g. phoneme/word/sentence) gets appended to the end of the WAV file. A certain Hans documented the WAV/SAPI format in 2009 here.

This can all be done by a simple modification of jballi's 2012 AutoHotkey version of TTSApp

Basically you replace these lines of code in Example1GUI.ahk

SpFileStream.Open(SaveToFileName,SSFMCreateForWrite,False)

;-- Set the output stream to the file stream
SpVoice.AllowAudioOutputFormatChangesOnNextSet:=False
SpVoice.AudioOutputStream:=SpFileStream

;-- Speak using the given flags
SpVoice.Speak(Text,SpeakFlags)

with the following:

SpFileStream.Open(SaveToFileName,SSFMCreateForWrite,True) ;-- DoEvents 

;-- Set the output stream to the file stream
SpVoice.AllowAudioOutputFormatChangesOnNextSet:=False
SpVoice.AudioOutputStream:=SpFileStream

if not Sink ;-- DoEvents label
  {
    ComObjConnect(SpVoice, "On")
    Sink:=True
  }

;-- Speak using the given flags
SpVoice.Speak(Text,SpeakFlags|SVSFlagsAsync|SVSFPurgeBeforeSpeak)

Generate timed-text synchronised with Text-to-Speech word-by-word?

1 Answers1