Issues with encoding and decoding UTF-8

Question

I'm trying to receive input text, save it to file and call another process that will apply text-to-speech to it. I've been struggling with encoding for days and need some help.

At first, I simply took the input text from the POST request as-is and saved it to a file, but I would get errors like this:

File "/home/.../merlin/src/run_merlin.py", line 1224, in <module>
main_function(cfg)
File "/home/.../merlin/src/run_merlin.py", line 572, in main_function
label_normaliser.perform_normalisation(in_label_align_file_list, binary_label_file_list, label_type=cfg.label_type)
File "/home/.../merlin/src/frontend/linguistic_base.py", line 68, in perform_normalisation
self.extract_linguistic_features(ori_file_list[i], output_file_list[i], label_type)
File "/home/.../merlin/src/frontend/label_normalisation.py", line 26, in extract_linguistic_features
A = self.load_labels_with_state_alignment(in_file_name)
File "/home/.../merlin/src/frontend/label_normalisation.py", line 487, in load_labels_with_state_alignment
utt_labels = fid.readlines()
File "/home/.../merlin/.venv/lib/python3.5/codecs.py", line 321, in decode

(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 0: invalid start byte

So I did the following:

text = text.encode('utf-8').split()
file = codecs.open(filename, 'w+', 'utf-8')
file.write(text.decode('utf-8'))
file.close()

But the same error keeps happening. I tried simply file.write(text), without decoding, but that gave me the following error:

Can't convert 'bytes' object to str implicitly

If it helps, I'm trying to work with Merlin, but as shown above, the error seems to be thrown by python's codecs.py when reading the file.

EDIT: Following Giacomo Catenazzi's suggestion I changed the code to:

text = text (not encoded)
file = codecs.open(filename, 'w+', 'utf-8')
file.write(text)
file.close()

But the same error happens. I added the full stack trace to the beginning of the question since the problem doesn't seem to be where I thought it was.

`text` is text, so you should not encode it. Remove the `text.encode` and `text.decode`. `open` will do the encoding. — Giacomo Catenazzi, May 14 '18 at 13:02
BTW on stack traces: you should always include all of them, or at minimum, you should include it starting from your code. The codec tells you that you are using it wrongly, but without the rest of stack trace, it is impossible to tell you what you did it wrong — Giacomo Catenazzi, May 14 '18 at 13:05
Thanks so much for trying to help me @GiacomoCatenazzi. I did change the code but the error persists. I also added the full stack trace. — Tuma, May 14 '18 at 13:11
Try ISO 8859-1 encoding. Read: https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte — Wafer, May 14 '18 at 16:04

Issues with encoding and decoding UTF-8

0 Answers0