How to deal with non-ASCII UTF-8 characters across my application stack

Question

I'm storing country names in a sqlite database exposed via a cpprest server. These country names are being queried by my web application, and the results returned by the server are raw binary strings (octet streams) that have the name length and the actual characters of the name embedded within.

I'm reading the country names into a std::string value like so:

country->Label = std::string((const char*)sqlite3_column_text(Query.Statement, 1));

I then copy them into a std::vector<char> buffer which is then sent back via the cpprest API through

Concurrency::streams::bytestream::open_istream<std::vector<char>>(buffer);

When my web application receives the data, I decode it like so:

var data = new Uint8Array(request.response);
var dataView = new DataView(data.buffer);

var nameLength = dataView.getUint32(0, true);

var label = "";

for(var k = 0 ; k < nameLength; k++)
{                   
    label += String.fromCharCode([dataView.getUint8(k + 4)])
}

For the most part, this works fine, up until I encounter a country name that contains non ASCII characters, then I get this abomination:

My understanding of UTF-8 is that it stores ASCII characters as normal, but non-ASCII characters across multiple bytes.

Which part of my application stack needs to be told when and where to use the multiple bytes for the non-ASCII characters and how would I go about doing that? My guess is, that since the web application is the one thats showing the text, this is where the change needs to happen, but I'm uncertain of how to do it.

Edit: Just to clarify, I have attempted the provided answers, but they don't seem to work either:

var labelArray = data.subarray(4, 4 + nameLength);                  
var label = new TextDecoder("utf-8").decode(labelArray);

which results in this:

It's probably that `fromCharCode()` in the JavaScript, yeah. You're treating each byte like it was an independent codepoint and not like it's part of a multibyte encoding. There's probably a much simpler approach than all that bytestream stuff if you read the cpprest and JavaScript documentation. — Shawn, Feb 07 '20 at 23:02
Why won't you interface with JavaScript normally using JSON? — rustyx, Feb 08 '20 at 09:30
The web app isnt a standard web app, its interactive so we're trying to squeeze as much performance out of this as possible and removing JSON parsing (on both the server and client side) is one of the ways we're squeezing the system. — Walter, Feb 08 '20 at 09:52
I highly doubt that `new TextDecoder("utf-8").decode()` is going to be faster than letting the JS engine parse JSON natively, which is what it is optimized to do. With gzip content-encoding there would also be little difference in payload size. — rustyx, Feb 12 '20 at 17:52

score 1 · Answer 1 · answered Feb 08 '20 at 00:54

1

var data = new Uint8Array(request.response);
var string = new TextDecoder("utf-8").decode(data);

answered Feb 08 '20 at 00:54

MaxV

2,601
3
18
25

Thanks for the answer, but unfortunately it didn't seem to work, I've updated my question to reflect what happens with this solution. – Walter Feb 08 '20 at 08:41

How to deal with non-ASCII UTF-8 characters across my application stack

1 Answers1