How to convert only the next one character from a UTF-8 byte array efficiently?

Question

I have this code which works:

QString qs = QString::fromUtf8(bp,ut).at(0);
QChar c(qs[0]);

Where bp is a QByteArray::const_pointer, and ut is the maximum expected length of the UTF-8 encoded Unicode code-point. I then grab the first QChar c from the QString qs. It seems that there should be a more efficient way to simply get only the next QChar from the UTF-8 byte array without having to convert an arbitrary amount of the QByteArray into a QString and then getting only the first QChar.

EDIT From the comments below, it is clear that no one yet understands my question. So I will start with some basics. UTF-8 and UTF-16 are two different encodings of the world standard Unicode. The most common and encouraged Unicode encoding for transfer over the Internet and Unicode text files is UTF-8 which results in every Unicode code-point using 1 to 4 bytes in UTF-8 encoding. UTF-16 on the other hand is more convenient for handling characters inside a program. Therefore the vast majority of software out there is making the conversion between these two encodings all the time. A QChar is the more convenient UTF-16 encoding of all the Unicode code-points from 0x00 to 0xffff, which covers the majority of the languages and symbols so far defined and in common use. Surrogate pairs are used for the higher Unicode code-point values. At present surrogate pairs seem to have limited support, and are not of interest to me as for the present question.

When you read a text file into a QPlainTextEdit the conversion is done automatically and behind the scenes. Reading a QString from a QByteArray can also be done automatically (provided your locale and codec settings are set for UTF-8), or they can be done explicitly using toUtf8() or fromUtf8() as in my code above.

The conversion in the other direction can efficiently be done implicitly (behind the scenes) or explicitly with the following code:

    ba += *si; // Depends on the UTF-8 codec

or

    ba += QString(*si).toUtf8(); // UTF-8 explicitly

where ba is a QByteArray and si is QString::const_iterator. These do exactly the same thing (assuming the codec is set to UTF-8). They both convert the next (one) character from the QChar pointed to within a QString resulting in appending one or more bytes in ba.

All I am trying to do is the inverse conversion for only one character at a time, efficiently. Internally this is being done for every character being converted, and I'm sure it is being done very efficiently.

The problem with QString::fromUtf8(p,n) is that n is the number of bytes to process rather than the number of characters to convert. Therefore you must allow for the largest number of bytes which could be 3 (or 4 if it actually handled surrogate pairs). So if all you want is the next character, you must be prepared to process several bytes, and they do get converted and then are discarded if the result is a QString with more than one character.

Q: Is there a conversion function that does this one character at a time?

An UTF8 character defines his byte size with the first byte. 0-127 means ASCII, so it's just this one byte. 194-223 means 2, 224-239 means 3 and 240-244 means 4 bytes. Other values are invalid for the first byte of an UTF8 character and following bytes must be 128-191. A QChar may save one character as codepoint which is simply the byte sequence as 4-byte integer (not used bytes are zero). — Youka, Feb 09 '16 at 16:38
Trying to save a UTF8 char in a QChar sounds like a recipe for disaster. What are you trying to accomplish here exactly? — MrEricSir, Feb 09 '16 at 16:59
@Harvey Well, when you "know all that, and more" it's obvious that you can't get a character without evaluating previous ones and qt has a high-level API, so conversion to QString and just keep him for further extractions is the best solution. I didn't found an UTF8 iterator over a byte sequence. — Youka, Feb 09 '16 at 17:58
@Harvey I'm not sure where you got the idea that QChar would be an appropriate container for storing UTF-8 formatted characters. QChar is designed to hold a UTF-16 character -- if you're using it for anything else you're almost guaranteed to wind up with some serious bugs. — MrEricSir, Feb 09 '16 at 18:12
@Harvey It converts data from UTF-8 to UTF-16 as the name implies (and as it states in the docs.) — MrEricSir, Feb 10 '16 at 18:23
Such UTF-16-code-unit-at-a-time iteration is not generally possible anyway, as there's no 1:N correspondence between UTF-8 and UTF-16. — n. m. could be an AI, Feb 11 '16 at 16:35
[XY problem](https://www.google.com/search?q=XY+problem). QChar is not a "Unicode character", it is an UTF-16 code unit. An UTF-16 code unit on its own makes no more sense than an UTF-8 code unit (i.e. a byte). — n. m. could be an AI, Feb 12 '16 at 16:57
@n.m. Thank you for informing me of the "XY problem." My original title was indeed poorly worded, and I should have left the result in QString. I would change the post, but I've already edited it too many times. It's a mess. I don't think it is helpful to the wider community. Is there a more efficient way to convert one Unicode code-point per call? — Harvey, Feb 14 '16 at 05:31
I don't think Qt provides a way, because it doesn't seem to have a "Unicode code-point" abstraction at all. Other libraries like UTF8-CPP have this functionality. C++11 has std::codecvt. — n. m. could be an AI, Feb 14 '16 at 05:43
I still wonder why you need to convert one character at a time. — n. m. could be an AI, Feb 14 '16 at 05:51
Suffice it to say that the byte following the currently pointed to Utf-8 encoded character may not be UTF-8 at all. I must stop after exactly one UTF-8 encoded character. If the current byte pointer is not pointing to the lead byte of a valid UTF-8 character, I would prefer to be notified that it is not valid UTF-8, rather than getting the "replacement character". I only want to ever deal with one Unicode character in the QString at a time. All that I am wanting is already happening under the fromUtf8() function anyway in a loop, one character at a time. Please just believe me. — Harvey, Feb 14 '16 at 07:00
@n.m. I added the above comment after seeing your "I still wonder why..." Then I noticed your comment just above that. Yes, that is what I have been painfully coming to believe. Thank you. — Harvey, Feb 14 '16 at 07:06
Hmm. I see Qt is giving you replacement characters instead of errors. In my view this makes this API utterly useless for any purpose, not just your specific purpose. Use UTF8-CPP or std::codecvt or whatever UTF-8 library is not brain damaged. I still don't understand why you need to convert one character at a time. Convert the whole bunch, stop if there's an error, then iterate over the converted string one character at a time. — n. m. could be an AI, Feb 14 '16 at 07:40
@n.m. Finally someone starting to understand my frustration. Trust me, my program works quite well, but there is some messy workaround code in it that is quite inefficient, making it a little slow. I would be more willing to give the why in a private chat. (stackoverflow is recommending that - I've not done that before. Can you start it and make it private?) — Harvey, Feb 14 '16 at 08:46
I have created the chat but I cannot make it private. http://chat.stackoverflow.com/rooms/103413/room-for-n-m-and-harvey — n. m. could be an AI, Feb 14 '16 at 10:05

score 1 · Answer 1 · answered Feb 11 '16 at 17:48

1

You want to use QTextDecoder.

It is, according to the documentation:

The QTextDecoder class provides a state-based decoder. A text decoder converts text from an encoded text format into Unicode using a specific codec. The decoder converts text in this format into Unicode, remembering any state that is required between calls.

The important thing here is state. QString and QTextCodec are stateless, so they work on entire strings, start to end.

QTextDecoder, on the other hand, allows you to work on byte buffers one byte at a time, maintaining a state between calls so the caller knows if an UTF-8 sequence has been only partially decoded.

For example:

QTextDecoder decoder(QTextCodec::codecForName("UTF-8"));
QString result;
for (int i = 0; i < bytearray.size(); i++) {
     result = decoder.toUnicode(bytearray.constData() + i, 1);
     if (!result.isEmpty()) {
          break; // we got our character !
     }
}

The rationale behind this loop is that as long as the decoder is not able to decode a complete UTF-8 character, it will return an empty string.

As soon as it is able to, the result string will contain the one decoded unicode character.

This loop is as efficient as possible, and by memorizing the loop index, next characters can be obtained in the same way.

answered Feb 11 '16 at 17:48

SirDarius

41,440
8
86
100

Thank you so much. I will definitely investigate this more. – Harvey Feb 12 '16 at 11:52
I do appreciate getting an answer rather than all the comments from people who it seems didn't read any more than the title. I have tried a bit with the QTextDecoder, but it does not seem to be a good fit for my purpose. I was so frustrated with all the comments and having to edit my post numerous times to try to make it clear what I wanted, that I decided to start another thread as a clean start. But immediately got a -1 and no response of any kind. I consider deleting this entire thread because I don't think it is helpful to the wider community. What do you think? – Harvey Feb 14 '16 at 05:04
You have obviously put a great deal of time and effort in your question, and I have tried to address the main point to the best of my understanding. However, what most people have tried to express in the comments is that the problem you describe seems narrow in scope. What is missing in the question is why you need to iterate on UTF-8 characters one byte at a time. 99.9% of developers only need to convert strings from one format to another, hence the apparent restriction in the QT API. So the edit that needs to be done is the why you need to do that thing that only 0.01% of developers need to. – SirDarius Feb 14 '16 at 15:38
Not "one byte at a time," that's useless for most purposes. I want one Unicode character at a time - what every UTF-8 conversion is actually doing at some level. – Harvey Feb 15 '16 at 09:07
Don't mind the typo, I actually reached maximum comment length, so I couldn't fix it :) – SirDarius Feb 15 '16 at 09:13

How to convert only the next one character from a UTF-8 byte array efficiently?

1 Answers1