parsing utf8 string from server response

Question

I had implemented app on some device which was dealing with sending receiving data from server. Data from server would usually come in this form:

"1;username;someInteger;"

Parsing was easy, and I was using strtok as you can imagine to retrieve individual values from that string such as: 1, username, and someInteger.

But now a situation may occur when the server will send me unicode string as username.

I think good idea is to use the username encoded as a UTF-8 string (am I right?). What do you recommend - how should I parse it from above string? What symbol to use as a separator for example (e.g., instead of ";"), or which functions to use to extract the username from above string?

as this is some embedded device I want to avoid installing some third party libraries there (which might not be even possible) so more "pure" ways would be more desirable.

Avoid `strtok`. It’s not thread-safe. Use `boost::split` instead. — , Oct 30 '13 at 08:52
@rightfold Avoid `boost` for a simple substitution of `strtok()`. It's too big. Use `strtok_r()` instead. — , Oct 30 '13 at 08:52
@H2CO3: Yes like I mentioned this is embedded device - I am yet trying to avoid installing some large third party libraries there (not sure even if that's possible) — , Oct 30 '13 at 08:53
@H2CO3 did you see `strtok` source code, or binary code generated for it? is it "small" comparing to "boost::split"? — Abyx, Oct 30 '13 at 09:54
@Abyx Where are all my comments? As to your question: [here](http://www.opensource.apple.com/source/Libc/Libc-167/string.subproj/strtok.c) is an implementation of `strtok()`, and here is [`iter_split()`](https://github.com/boostorg/algorithm/blob/master/include/boost/algorithm/string/iter_find.hpp) that `boost::algorithm::string::split()` uses. Altogether, `strtok()` is fewer SLOC than the boost thingy, but it has the advantage that it's 1. standard, 2. doesn't require inclusion of huge headers, 3. it works in C too. — , Oct 30 '13 at 10:21

score 4 · Accepted Answer · answered Oct 30 '13 at 08:52

4

The character ';' is the same in UTF-8 as it is in ASCII, because the 127 first characters in both encodings are the same. That means you can still use strtok to split on the ';'.

answered Oct 30 '13 at 08:52

Some programmer dude

400,186
35
402
621

I have heard strtok may "stop" if in between it encounters "null terminating character" - can't it be the case that the Unicode string contains some characters in between two separators (;) that strtok will interpret as null terminating character? – Oct 30 '13 at 08:56
2

@dmcr_code no, Multibyte sequences contain only bytes with values >= 128 (or <0, depending on char signedness), so any ASCII byte *is* an ASCII code point.there's no codepoint except the null character that contains a null byte (see http://stackoverflow.com/questions/6907297/can-utf-8-contain-zero-byte). In other words: if strtok encounters a null byte, it *is* the zero-delimiter and nothing else. The same applies for any other ASCII value: – Arne Mertz Oct 30 '13 at 09:00
2

meaning: the `;` can't be found as part of a multibyte sequence either, so you won't get false positives. – Arne Mertz Oct 30 '13 at 09:07
@Arne Mertz: ok I got it I was just wondering maybe there would be some non ascii characters in the string whose code value(code point) was like this for instance: 45 00 - then strtok would interpret last byte of this symbol as null terminator right? (but I think you said this can't be the case) – Oct 30 '13 at 09:09
3

@dmcr_code yes, that can't be. Neither 45 nor 00 can be part of a multibyte sequence (non-ASCII code point) - they contain only values > 80 (hex) – Arne Mertz Oct 30 '13 at 09:15

score 0 · Answer 2 · answered Oct 30 '13 at 08:51

0

The very thing with UTF8 is that you hardly have to do anything at all. ASCII characters still encode as the same ASCII bytes they always would, so if you just continue to use semicolon separators, you don't have to do anything at all.

answered Oct 30 '13 at 08:51

Dolda2000

25,216
4
51
92

I think strtok may stop if in between it encounters "null terminating character" - can't it be the case that the Unicode string contains some characters in between two separators (;) that strtok will interpret as null terminating character? – Oct 30 '13 at 08:58
Nope: the only "null terminating character" is ASCII 0, aka NUL. No UTF-8 encoding contains any NUL characters except that for NUL itself, which, again, is just as it would be in normal ASCII. – Dolda2000 Oct 31 '13 at 05:02

parsing utf8 string from server response

2 Answers2