I'm trying to receive input text, save it to file and call another process that will apply text-to-speech to it. I've been struggling with encoding for days and need some help.
At first, I simply took the input text from the POST request as-is and saved it to a file, but I would get errors like this:
File "/home/.../merlin/src/run_merlin.py", line 1224, in <module>
main_function(cfg)
File "/home/.../merlin/src/run_merlin.py", line 572, in main_function
label_normaliser.perform_normalisation(in_label_align_file_list, binary_label_file_list, label_type=cfg.label_type)
File "/home/.../merlin/src/frontend/linguistic_base.py", line 68, in perform_normalisation
self.extract_linguistic_features(ori_file_list[i], output_file_list[i], label_type)
File "/home/.../merlin/src/frontend/label_normalisation.py", line 26, in extract_linguistic_features
A = self.load_labels_with_state_alignment(in_file_name)
File "/home/.../merlin/src/frontend/label_normalisation.py", line 487, in load_labels_with_state_alignment
utt_labels = fid.readlines()
File "/home/.../merlin/.venv/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 0: invalid start byte
So I did the following:
text = text.encode('utf-8').split()
file = codecs.open(filename, 'w+', 'utf-8')
file.write(text.decode('utf-8'))
file.close()
But the same error keeps happening. I tried simply file.write(text)
, without decoding, but that gave me the following error:
Can't convert 'bytes' object to str implicitly
If it helps, I'm trying to work with Merlin, but as shown above, the error seems to be thrown by python's codecs.py
when reading the file.
EDIT: Following Giacomo Catenazzi's suggestion I changed the code to:
text = text (not encoded)
file = codecs.open(filename, 'w+', 'utf-8')
file.write(text)
file.close()
But the same error happens. I added the full stack trace to the beginning of the question since the problem doesn't seem to be where I thought it was.