Split a string that holds characters in different sizes

Question

I have an input string that holds characters in different sizes, for example const char * input = "aadđ€€¢¢". The strlen gave result 15, which mean while 'aad' only took 3 bytes, the other special characters took 2 bytes or more each.

How can I cut characters that fit into 6 bytes from the start of that string? Which mean in this case only 'aadđ' will be taken because aadđ€ would occupy 8 bytes.

I tried normal split character methods but none worked so far. Edit: Because a wide character might get split in the middle and therefore I will get some garbage or a different character instead.

You're using a C-style string. Is your question then about the C language? C or C++: choose *one*. — John Bollinger, May 30 '19 at 12:10
Hi, I would gladly accept the solution in C++ as well. But I'm working with a restricted environment so only standard libraries are accepted. — An Phong, May 30 '19 at 12:32
Which encoding are you using? Also what language, not all languages can be split using using standard library functions. This is particularly true of any language not using the latin alphabet. — Mgetz, May 30 '19 at 12:37
What you need is a way of extracting character sets that represent a potentially multibyte sequence. — Fureeish, May 30 '19 at 12:39
Related assuming you can use a space to split https://stackoverflow.com/q/236129/332733 — Mgetz, May 30 '19 at 12:43
The encoding question is about your compilation's "-fexec-charset" or equiv. ("-fsource-charset" being right is not an information question; It's just a fundamental requirement.) — Tom Blodget, May 30 '19 at 16:59
Thanks you guys. The input is from a smartphone cast to another device so I was confused about how to handle it. The encoding is UTF-8 so I will read more about it to solve this problem. — An Phong, May 31 '19 at 02:51
I voted for reopen, because this question is clearly not about debugging. It also has a clear problem statement "How can I cut characters that fit into 6 bytes from the start of that string?" — Olaf Dietsche, Jun 01 '19 at 11:36

score 2 · Answer 1 · edited May 30 '19 at 17:33

2

You need to understand the difference between "bytes" and "characters".

A byte is the smallest unit of computer storage, holds 8 bits of information. A character (a Unicode code point to be exact) is a number from 0 to 0x10FFFF that is represented by one or more bytes, depending on the encoding in use. A character is associated with some "glyph", a picture that's part of various fonts.

The characters with codes 0 through 127 (usually called "ASCII characters", but technically called "C0 Controls and Basic Latin" block) are encoded in one byte. Those include English letters, numbers and some punctuation. The rest of the characters are encoded in multiple bytes. Please look up UTF-8 and UTF-16 for some examples of how the encoding is done.

To answer your question, given the string in your example, you can cut 6 bytes at the beginning of the string, but the last bytes may not represent a valid character. In UTF-8, it will be a "prefix" byte that will be followed by one to three bytes to form a complete code point.

edited May 30 '19 at 17:33

Khouri Giordano

1,426
15
17

answered May 30 '19 at 12:51

UTF-8, UTF-16 and UTF-32 are encodings of the same Unicode **code points**. Due to composition, it might take more than one code point to represent one **glyph**, what you would recognize as a character. Figuring that last part out is a job for a library like ICU. For most simple purposes, you can check the byte length for UTF-8 code points pretty easily. Read about the encoding on Wikipedia. – Khouri Giordano May 30 '19 at 13:20
@KhouriGiordano You are right, I completely forgot about the composition. I am not sure if adding whole another level of abstraction would improve the OP's understanding :) . I'll change the answer into Community Wiki, please feel free to edit. Or provide you r own, of course. – May 30 '19 at 16:28
'Usually called "ASCII characters" but it's not quite right name': Right, in the context of the Unicode character set, they are the [C0 Controls and Basic Latin](http://www.unicode.org/charts/nameslist/index.html) block. – Tom Blodget May 30 '19 at 17:02
Thank you. I will look up UTF-8 to understand how the encoding is done to solve this. – An Phong May 31 '19 at 02:52

Olaf Dietsche · Accepted Answer · 2019-06-02T14:46:04.793

strlen counts bytes not characters. To step through the string characterwise, you might try mblen, which looks at the next character in a string. If the string's encoding is not UTF-8, you must adjust the call to setlocale accordingly

std::setlocale(LC_ALL, "en_US.utf8");
const char *input = "aadđ€€¢¢";
int clen;
mblen(0, 0);
for (const char *p = input; *p != 0; p += clen) {
    clen = mblen(p, 4);
    std::cout << p << ", clen=" << clen << '\n';
}

To get exactly 6 bytes might prove difficult, because this might stop midway in a multi-byte character

int len = 0, clen;
mblen(0, 0);
for (const char *p = input; *p != 0 && len < 6; p += clen, len += clen) {
    clen = mblen(p, 4);
}

char buf[10];
strncpy(buf, input, len);
buf[len] = 0;

This would stop as soon as 6 or more bytes are reached.

To get at most 6 bytes, subtract the last character before copying, if there's an overrun

if (len > 6)
    len -= clen;

Thank you Olaf. Your solution solved my problem. – An Phong May 31 '19 at 03:41 — An Phong, May 31 '19 at 03:41

score 0 · Answer 3 · answered May 30 '19 at 12:17

0

Can't understand your problem since you didn't describe the issue you encountered. But this should work. The only problem may be that a wide character might get split in the middle and you can get a different char

char input2[7] = {0};
memcpy(input2, input, 6);

If you want to get the wchar len you can use wcslen()

http://www.cplusplus.com/reference/cwchar/wcslen/

answered May 30 '19 at 12:17

Shlomi Agiv

1,183
7
17

Thank you. You are right. The problem I'm having is that a wide character might get split in the middle and I have a different char to display. – An Phong May 30 '19 at 12:27
You can't use `wcslen()` here. Wide characters are not a miracle solver for multibyte encodings. – Fureeish May 30 '19 at 12:38

Split a string that holds characters in different sizes

3 Answers3