TL;DR: I'm trying to pass an XML object (using ET) to a Comtypes (SAPI) object in python 3.7.2 on Windows 10. It's failing due to invalid chars (see error below). Unicode characters are read correctly from the file, can be printed (but do not display correctly on the console). It seems like the XML is being passed as ASCII or that I'm missing a flag? (https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85)). If it is a missing flag, how do I pass it? (I haven't figured that part out yet..)
Long form description
I'm using Python 3.7.2 on Windows 10 and trying to send create an XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) so I found a UTF-8 encoded library containing IPA pronunciations. When I try to call the SAPI, with parts of the code replaced I get the following error:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
I've been trying to debug, but when I print the pronunciations of the words the characters are boxes. However if I copy and paste them from my console, they look fine (see below).
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
Best Guess
I'm unsure whether the problem is caused by 1) I've changed versions of pythons to be able to print unicode 2) I fixed problems with reading the file 3) I had incorrect manipulations of the string
I'm pretty sure the problem is that I'm not passing it as a unicode to the comtype object. The ideas I'm looking into are 1) Is there a flag missing? 2) Is it being converted to ascii when its being passed to comtypes (C types error)? 3) Is the XML being passed incorrectly/ am I missing a step?
Sneak peek at the code
This is the class that reads the IPA dictionary and then generates the XML file. Look at _load_phonemes and _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
I've also added how the code writes to the comtype object (SAPI) in case the error is there.
def __call__(self,text,outputFile):
# https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
Thanks in advance for your help!