How to read UTF-8 string given its length in characters in plain C89?

Question

I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)

The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.

But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length() returns. I assume it is actual UTF-8 characters, not something else.)

I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *.)

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

Example:

U
7
Юникод!

Update:

One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.

And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.

Here is the code I wrote. Far from "10 minutes to implement"... http://codereview.stackexchange.com/questions/1624/please-review-my-utf-8-character-reader-function — Alexander Gladysh, Apr 02 '11 at 14:04

score 9 · Answer 1 · answered Apr 01 '11 at 18:20

9

It's pretty easy to write a UTF-8 "reader" given the information here; UTF-8 was designed so tasks like this one would be easy.

In essence, you start reading characters until you read as many as the client tells you. You know that you 've read a whole character given the UTF-8 encoding definition, specifically:

If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern.

answered Apr 01 '11 at 18:20

Jon

428,835
81
738
806

Lazy question: is there a code that does this, that I can reuse? (Or at least a set of test data so I would know that I did not screw up the implementation.) :-) – Alexander Gladysh Apr 01 '11 at 18:21
@AlexanderGladysh: I just updated with four lines of text which I think you can implement in 10 minutes or so :) – Jon Apr 01 '11 at 18:22
@Jon: Thanks! But a test set of data would be nice. I vaguely remember seeing somewhere a text file with a lot of weird UTF-8 stuff for related purposes. Will try to google it up... – Alexander Gladysh Apr 01 '11 at 18:27
@Jon: Looking at what other implementations I could google up do to read UTF-8, I'm not sure that it can be done in 10 minutes properly... Well, I can try :-) – Alexander Gladysh Apr 01 '11 at 19:38
2

This is the file I mentioned: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt – Alexander Gladysh Apr 01 '11 at 20:02
3

If you roll your own, you need to be very careful not to misinterpret invalid byte sequences as if they were UTF-8, especially if it's important that the length later be valid... – R.. GitHub STOP HELPING ICE Apr 01 '11 at 20:02
@R: This is what I'm worried about. So, if there is an implementation I can reuse, it would be great. – Alexander Gladysh Apr 01 '11 at 20:11

score 2 · Answer 2 · edited May 23 '17 at 10:28

2

Well, the length property of JavaScript strings seems to count codepoints, not characters, as you can see (but wait! it's not quite codepoints):

> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>

Although that's with V8. Looking around it seems that that's actually what the ECMAScript standard requires:

https://forums.teradata.com/blog/jasonstrimpel/2011/11/javascript-string-length-and-internationalizing-web-applications

Also, checking ECMA-262, on pages 40-41 of the PDF it says "The length of a String is the number of elements (i.e., 16-bit values) within it", and then goes on to make clear that the elements are UTF-16 units. Sadly that's not quite "codepoints". Basically, this makes the string length property rather useless. Looking around I find this:

How can I tell if a string contains multibyte characters in Javascript?

edited May 23 '17 at 10:28

Community

1
1

answered Feb 15 '12 at 23:15

Nico

324
2
2

OK, but what am I to do if `length` is useless? Should I introduce some terminators to the protocol instead? How to escape them then to be binary-safe? – Alexander Gladysh Feb 16 '12 at 02:53
Can you do anything to the client? Or must you rely on what it says? – Nico Feb 16 '12 at 05:07
What do you mean "do anything"? That's "my" protocol, client would have to adhere to it. But the protocol must be implementable for JS client, with good cross-browser support. – Alexander Gladysh Feb 16 '12 at 09:37
Well, if it's not a binary protocol then you don't need to worry about length. Just learn to parse JSON in the server... It's better that way. – Nico Feb 16 '12 at 22:49
No JSON, sorry. Let's say that I'm teaching myself manual protocol design tricks, so that we wouldn't argue about why JSON is not applicable here. – Alexander Gladysh Feb 16 '12 at 23:33
(BTW, after some thought, I came to conclusion that my parsing code also works with codepoints, strangely enough.) – Alexander Gladysh Feb 16 '12 at 23:35
Well, good luck then. Protocol design is not easy. You've already run into the problem of how to count things on the wire. Hopefully you don't end up with buffer overflows. – Nico Feb 17 '12 at 00:31

score 1 · Answer 3 · answered Feb 13 '12 at 18:06

Characters? Or codepoints? The two are not the same. Unicode is... complex. You could count all of these different things about a UTF-8 string: length in bytes, length in codepoints, length in characters, length in glyphs, and length in grapheme clusters. All of those might come out different for any given string!

My first inclination is to tell that broken client to go away. But assuming you can't do that you need to ask what exactly the client is counting. The simplest thing to count, after bytes, is codepoints -- that's what UTF-8 encodes, after all. After that? characters, but you need to have tables of composing codepoints so that you can identify sequences of codepoints that make up a character. If the client counts glyphs or grapheme clusters then you're in for a world of hurt. But most likely the client counts either codepoints or characters. If it counts codepoints then just count bytes with with binary values 10xxxxxx and 0xxxxxxx (though you probably want to implement enough UTF-8 to protect against overlong sequences). If it counts characters then you need to identify combining marks and count them as part of the associated non-combining codepoint.

Whatever `String.length()` in JavaScript returns. I think it is characters. I updated the question to reflect that. — Alexander Gladysh, Feb 13 '12 at 21:12

score 0 · Accepted Answer · answered Mar 14 '14 at 05:25

0

This looks like exactly the thing I'd need. Wish I found it earlier:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

answered Mar 14 '14 at 05:25

Alexander Gladysh

39,865
32
103
160

score 0 · Answer 5 · answered Apr 01 '11 at 18:21

0

If the length you get doesn't match the number of bytes you get, you have a couple of choices.

Read one byte at a time and assemble them into characters until you get matching number of characters.
Add a known terminator and skip the string size entirely. Just read one byte at a time until you read the terminator sequence.
Read a the number of bytes listed in the header (since that's the minimum number). Figure out if you have enough characters. If not, read some more!

answered Apr 01 '11 at 18:21

Carl Norum

219,201
40
422
469

1. That's what I'm asking about — how to do this. 2. Terminator is not an option — it would need to be escaped in data, which complicates things in a bad way. – Alexander Gladysh Apr 01 '11 at 18:25
@Alexander - gotcha - then @Jon's answer is all you need. – Carl Norum Apr 01 '11 at 18:26
Yep. Just waiting for someone to come up with test suite data. :-) – Alexander Gladysh Apr 01 '11 at 18:29

score 0 · Answer 6 · answered Apr 01 '11 at 18:23

0

If the DATA can't contain a CRLF, it seems that you could use the CRLF as a framing delimiter. Just ignore the SIZE and read until CRLF.

answered Apr 01 '11 at 18:23

paleozogt

6,393
11
51
94

How to read UTF-8 string given its length in characters in plain C89?

6 Answers6

Linked