0

i have next "problem" in my application, i write app where someone will write text, SAPI TTS translate it in speech and next i will work with the output WAV. What i need are information about phonemes (where in the output WAV is some phoneme, how long voice say it, etc).. ok, i used SpVoice.Phoneme() and i added handler for phonemes. Ok, now i can get duration etc..but in SpVoice.Phoneme() is attribute StreamPosition but i have not idea what that means..

from MSDN:

StreamPosition
The character position in the output stream at which the phoneme begins.

I dont understand if they mean "byte" position in output WAV (on WHICH byte is the phoneme)..or millisecond time in output WAV..or what that mean??

For example, for text:

This is high. This is low. This is fast. This is slow.

I get the StreamPositions values:

Position:0
Position:120
Position:2562
....
Position:143798
Position:147874
Position:151950

The output WAV file have 5.377098seconds and last phoneme "ow" is told circa in 4.734s. The output WAV file have 237 568bytes.. So the value of attribute StreamPosition "147874" is probably not the byte on which begin the phoneme. The same for "timing" (in ms because WAV have 5.3s but 151950ms is 151,950s..so this is closed..).

So what is the StreamPosition? (what means the value in StreamPosition?)

I really need catch exactly time when the phoneme begin. I tried it with DateTime.Now.Ticks/10000. When user click on button for start translating TTS i save this datetime value and when some handler catch some phoneme i catch the value again. And then i will get the value with currTime-startTime. But this "method" is not so exact. There are always some divergency. Have SpVoice.Phoneme() some "method" or something to get exactly information about the time when phoneme began? If not, is there some better way to get exactlier time in ms?

sry for my english and really thanks for all answers and advices..

tomdelahaba
  • 948
  • 3
  • 13
  • 26
  • Give [System.Diagnostics.StopWatch](http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.aspx) a try. – Mark Hall Jan 29 '12 at 02:18
  • I will try that but i am not sure if this will help me..Still will be there some divergency with running processes on PC, etc..But maybe this will be better like a ticks :).. – tomdelahaba Jan 29 '12 at 02:26
  • So difference between datetime.now.ticks and stopwatch is something between 1-4ms (and i have some operations between those 2lines..so that will be because of those commands there..) – tomdelahaba Jan 29 '12 at 03:08
  • You are pushing the limits of resolution. I beleive the closest you will be able to get is within 15 ms. Take a look at this [link](http://stackoverflow.com/questions/3744032/why-are-net-timers-limited-to-15-ms-resolution). I beleive I misunderstood you. but I will still leave this comment – Mark Hall Jan 29 '12 at 03:39
  • Thank you for link, i will read it moretimes because there are something what i dont understand..but..what can i say now..how i see, timer or stopwatch is bad solution for my problem (for get information about start of the phoneme)..but what other can i do when SAPI TTS, Phoneme() dont implements "timing" (it implements only duration)..uaaaaaa..thanks for answer mr.Hall :) – tomdelahaba Jan 29 '12 at 04:46

2 Answers2

1

ok, i will answer myself.. My bachelors profesor sended me some code in C++ what he wrote.. I readed it last 2days and now i see how stupid I am.

so i will answer..

attribute StreamPosition is really "bites" position in the output stream (probably WAV).

If you want to know millisecond position in the output stream, you need write something like:

(int)StreamPosition/(double)wavFileFormat_samplesPerSec/((double)wavFileFormat_BitsPerSample/8)

so you need find information about the outputStream like bitsPerSample, SamplesPerSec and you will get the milliseconds timing.

tomdelahaba
  • 948
  • 3
  • 13
  • 26
0

1) I am not sure how you save the output to wav file,but the file size 237 568bytes is larger than normal(if sampling rate is 16khz), as file size for a 5.377098seconds wav file

is 5.377098*16000*2 = 172067 bytes + header(44 bytes)

so, I think your wav file contains phoneme event as well.

2)TTS take time to generate output so you can't timing in that way, I suggest you:

2.1)record the phoneme event as you may already done in 1

You can also refer to Windows SDK 

C:\Program Files\Microsoft SDKs\Windows\v7.1\Samples\winui\speech\ttsapplication

           if (SUCCEEDED(hr))
        {
        //  OriginalFmt.WaveFormatExPtr()->nSamplesPerSec;
            hr = SPBindToFile( m_szWFileName, SPFM_CREATE_ALWAYS, &cpWavStream, &OriginalFmt.FormatId(), OriginalFmt.WaveFormatExPtr(),SPFEI_ALL_TTS_EVENTS); 
        }
        if( SUCCEEDED( hr ) )
        {
            // Set the voice's output to the wav file instead of the speakers
            hr = m_cpVoice->SetOutput(cpWavStream, TRUE);

        }

2.2)Timing by other event like stream start <= I am not so sure about the exactly name.

in Windows SDK:

    while (m_cpVoice->GetEvents(1, &event, &ul) == S_OK) 
        { 
            if (event.eEventId == SPEI_VISEME) 
            { 
                printf("v: %i\'",event.lParam); // viseme 
                printf("t: %i\'",event.wParam); // duration of viseme 
            } 
            else if (event.eEventId == SPEI_END_INPUT_STREAM) 
            { 

            } else if (event.eEventId == SPEI_START_INPUT_STREAM)
            {
            }
        }

But the code is not in C#

Steven Du
  • 1,681
  • 19
  • 35
  • Thx for answer, i am not sure, what you mean with 2.1)record the phoneme event as you may already done in 1. if this will help i searched some adviced about C++.there is some SPEI_"SOMETHING" events and SPEVENT and there is attribute ullAudioStreamOffset and this is exactly what i need. But this events are in C++ but when i go in Object browser and i am looking on that i have not SPEVENT there in C#.And that is probably the problem. If i should get the ullAudioStreamOffset i will count the position in milliseconds.if you will describe me what you mean with the 2.1 i will try that.Thanks! – tomdelahaba Jan 30 '12 at 03:31