Python: Using SSML with SAPI (comtypes)

Question

TL;DR: I'm trying to pass an XML object (using ET) to a Comtypes (SAPI) object in python 3.7.2 on Windows 10. It's failing due to invalid chars (see error below). Unicode characters are read correctly from the file, can be printed (but do not display correctly on the console). It seems like the XML is being passed as ASCII or that I'm missing a flag? (https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85)). If it is a missing flag, how do I pass it? (I haven't figured that part out yet..)

Long form description

I'm using Python 3.7.2 on Windows 10 and trying to send create an XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) so I found a UTF-8 encoded library containing IPA pronunciations. When I try to call the SAPI, with parts of the code replaced I get the following error:

Traceback (most recent call last):
  File "pdf_to_speech.py", line 132, in <module>
    audioConverter(text = "Hello world extended test",outputFile = output_file)
  File "pdf_to_speech.py", line 88, in __call__
    self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))

I've been trying to debug, but when I print the pronunciations of the words the characters are boxes. However if I copy and paste them from my console, they look fine (see below).

həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst

Best Guess

I'm unsure whether the problem is caused by 1) I've changed versions of pythons to be able to print unicode 2) I fixed problems with reading the file 3) I had incorrect manipulations of the string

I'm pretty sure the problem is that I'm not passing it as a unicode to the comtype object. The ideas I'm looking into are 1) Is there a flag missing? 2) Is it being converted to ascii when its being passed to comtypes (C types error)? 3) Is the XML being passed incorrectly/ am I missing a step?

Sneak peek at the code

This is the class that reads the IPA dictionary and then generates the XML file. Look at _load_phonemes and _pronounce.

class SSML_Generator:
    def __init__(self,pause,phonemeFile):
        self.pause = pause
        if isinstance(phonemeFile,str):
            print("Loading dictionary")
            self.phonemeDict = self._load_phonemes(phonemeFile)
            print(len(self.phonemeDict))
        else:
            self.phonemeDict = {}

    def _load_phonemes(self, phonemeFile):
        phonemeDict = {}
        with io.open(phonemeFile, 'r',encoding='utf-8') as f:
            for line in f:
                tok = line.split()
                #print(len(tok))
                phonemeDict[tok[0].lower()] = tok[1].lower()
        return phonemeDict

    def __call__(self,text):
        SSML_document = self._header()
        for utterance in text:
            parent_tag = self._pronounce(utterance,SSML_document)
            #parent_tag.tail = self._pause(parent_tag)
            SSML_document.append(parent_tag)
        ET.dump(SSML_document)
        return SSML_document

    def _pause(self,parent_tag):
        return  ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})

    def _header(self):
        return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})

    # TODO: Add rate https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
    def _rate(self):
        pass

    # TODO: Add pitch 
    def _pitch(self):
        pass

    def _pronounce(self,word,parent_tag):
        if word in self.phonemeDict:
            sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
            return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
        else:
            return parent_tag
    # Nice to have: Transform acronyms into their pronunciation (See say as tag)

I've also added how the code writes to the comtype object (SAPI) in case the error is there.

def __call__(self,text,outputFile):
        # https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
        self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
        self.engine.AudioOutputStream = self.stream
        text = self._text_processing(text)
        text = self.SSML_generator(text)
        text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
        self.engine.speak(text)
        self.stream.Close()

Thanks in advance for your help!

The answer to this post is exactly what I want to do. https://stackoverflow.com/questions/31167967/python-3-4-text-to-speech-with-sapi/31172101#31172101. However I have no way of commenting and the answer is not included. — Felix Labelle, Apr 10 '19 at 23:46

Ivan Muravlev · Accepted Answer · 2019-12-17T15:01:25.517

2

Try to use single quotes inside ph attrubute. Like this

my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'

also remember to use \ to escape single quote

UPD Also this error could mean that your ph cannot be parsed. You can check docs there: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup

this example will work

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice  name="en-US-Jessa24kRUS">
    <s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
  </voice>
</speak>

but this doesn't

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice  name="en-US-Jessa24kRUS">
    <s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
  </voice>
</speak>

edited Dec 17 '19 at 15:01

answered Dec 17 '19 at 14:50

Ivan Muravlev

179
1
6

Thanks! I haven't looked at this code in a while, but I'll try your suggestions tonight. – Felix Labelle Dec 18 '19 at 19:48
I think I was using an Arpabet based corpus.. I'm trying to find a large enough IPA based corpus just to be sure that this is the issue – Felix Labelle Dec 20 '19 at 02:39
The error was really silly, there was a trailing comma in certain phonemes. The file was being read incorrectly. – Felix Labelle Dec 20 '19 at 03:22

Python: Using SSML with SAPI (comtypes)

Long form description

Best Guess

Sneak peek at the code

1 Answers1