Since there doesn't seem to be a way to go "below" the QWebChannel (without hacking Qt itself) there seemed to be three options:
Create a separate WebSockets connection. That would require bigger non-local changes, and possibly creating extra processes.
Do the byte-to-text decoding on the C++ side, maybe using QTextDecoder. That would probably be simplest. However, there are architectural reasons for preferring to do text decoding close to where we do escape sequence processing. That is certainly more traditional, and may better handle corner cases. Plus DomTerm needs to work with other front-ends than the QtDomTerm front-end.
Encode the bytestream to a sequence of strings, transmit the latter using QWebChannel, convert back to bytes on the JavaScript side, and then handle text-decoding in JavaScript.
I chose to implement the third option. There is some overhead in therms of converting back-end-forth. However, I designed and implemented an encoding that is quite efficient:
/** Encode an arbitrary sequence of bytes as an ASCII string.
* This is used because QWebChannel doesn't have a way to transmit
* data except as strings or JSON-encoded strings.
* We restrict the encoding to ASCII (i.e. codes less then 128)
* to avoid excess bytes if the result is UTF-8-encoded.
*
* The encoding optimizes UTF-8 data, with the following byte values:
* 0-3: 1st byte of a 2-byte sequence encoding an arbitrary 8-bit byte.
* 4-7: 1st byte of a 2-byte sequence encoding a 2-byte UTF8 Latin-1 character.
* 8-13: mean the same ASCII control character
* 14: special case for ESC
* 15: followed by 2 more bytes encodes a 2-byte UTF8 sequence.
* bytes 16-31: 1st byte of a 3-byte sequence encoding a 3-byte UTF8 sequence.
* 32-127: mean the same ASCII printable character
* The only times we generate extra bytes for a valid UTF8 sequence
* if for code-points 0-7, 14-26, 28-31, 0x100-0x7ff.
* A byte that is not part of a valid UTF9 sequence may need 2 bytes.
* (A character whose encoding is partial, may also need extra bytes.)
*/
static QString encodeAsAscii(const char * buf, int len)
{
QString str;
const unsigned char *ptr = (const unsigned char *) buf;
const unsigned char *end = ptr + len;
while (ptr < end) {
unsigned char ch = *ptr++;
if (ch >= 32 || (ch >= 8 && ch <= 13)) {
// Characters in the printable ascii range plus "standard C"
// control characters are encoded as-is
str.append(QChar(ch));
} else if (ch == 27) {
// Special case for ESC, encoded as '\016'
str.append(QChar(14));
} else if ((ch & 0xD0) == 0xC0 && end - ptr >= 1
&& (ptr[0] & 0xC0) == 0x80) {
// Optimization of 2-byte UTF-8 sequence
if ((ch & 0x1C) == 0) {
// If Latin-1 encode 110000aa,10bbbbbb as 1aa,0BBBBBBB
// where BBBBBBB=48+bbbbbb
str.append(4 + QChar(ch & 3));
} else {
// Else encode 110aaaaa,10bbbbbb as '\017',00AAAAA,0BBBBBBB
// where AAAAAA=48+aaaaa;BBBBBBB=48+bbbbbb
str.append(QChar(15));
str.append(QChar(48 + (ch & 0x3F)));
}
str.append(QChar(48 + (*ptr++ & 0x3F)));
} else if ((ch & 0xF0) == 0xE0 && end - ptr >= 2
&& (ptr[0] & 0xC0) == 0x80 && (ptr[1] & 0xC0) == 0x80) {
// Optimization of 3-byte UTF-8 sequence
// encode 1110aaaa,10bbbbbb,10cccccc as AAAA,0BBBBBBB,0CCCCCCC
// where AAAA=16+aaaa;BBBBBBB=48+bbbbbb;CCCCCCC=48+cccccc
str.append(QChar(16 + (ch & 0xF)));
str.append(QChar(48 + (*ptr++ & 0x3F)));
str.append(QChar(48 + (*ptr++ & 0x3F)));
} else {
// The fall-back case - use 2 bytes for 1:
// encode aabbbbbb as 000000aa,0BBBBBBB, where BBBBBBB=48+bbbbbb
str.append(QChar((ch >> 6) & 3));
str.append(QChar(48 + (ch & 0x3F)));
}
}
return str;
}
Yes, this is overkill and almost certainly premature optimization, but I can't help myself :-)