substr with characters instead of bytes

Question

Suppose i have a string s = "101870002PTäPO PVä #Person Tätigkeitsdarstellung 001100001&0111010101101870100092001000010"

When I do a substring(30,40) it returns " #Person Tätigkeitsdarstellung" beginning with a space. I guess it's counting bytes instead of characters.

Normally the size of the string is 110 and when I do a s.length() or s.size() it returns 113 because of the 3 special characters.

I was wondering if there is a way to avoid this empty space at the beginning of the return value.

Thanks for your help!

Not bytes, well not directly anyway, but `char`. If you have a variable-length encoding of the string stored in the `std::string`, you have to handle it yourself. — Some programmer dude, Aug 04 '14 at 10:10
`Normally the size of the string is 110 and when I do a s.length() or s.size() it returns 113 because of the 3 special characters.` Then that means that the string is *not* 110 characters, but 113 characters. The size() function doesn't lie. Also, what are these "special characters"? Carriage returns, control characters, ...? — PaulMcKenzie, Aug 04 '14 at 10:12
s.length() gives me 110 the i add one more ä then gives me 111 add one more # then it gives me 112 add one more & guess what it returns, surprise surprise 113. Can you give information about your machine and ide ,maybe these cause to problem.But if you want i can write a function do same job with s.substr() if yours doesnt work. — oknsnl, Aug 04 '14 at 10:21
@PaulMcKenzie what I meant for special characters were German characters. — zuubs, Aug 04 '14 at 10:40
I guess I expressed myself wrongly about the size of the actual string. The number of characters counted in the string 's' is 110 and s.size() gives 113. — zuubs, Aug 04 '14 at 10:44
@zuubs, I think that the correct terminology might here be "110 code points" — eerorika, Aug 04 '14 at 10:48
I search but i cant find but i give gurantee win7 and vs10 count correctly it can be probably from your devices or ide.I suggest try g++ or something but eclipse it can be special case to eclipse or ubuntu. — oknsnl, Aug 04 '14 at 11:20
@PaulMcKenzie The size function doesn't lie, but it doesn't return the number of characters in the string either; only the number of `char`. — James Kanze, Aug 04 '14 at 11:46

eerorika · Answer 1 · 2014-08-04T15:01:18.910

In utf-8, the code point (character) ä consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size() is 2.

A simple approach is to use std::wstring. wstring uses a character type (wchar_t) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.

A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.

The wstring approach also doesn't take control or composite characters into consideration.

Here's a little test code which converts a std::string containing a multi byte utf-8 character ä and converts it to a wstring:

string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system

Unfortunately, libstdc++ hasn't implemented <codecvt> which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.

Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.

Not all characters can be represented by a single code point in Unicode. Depending on what he is doing (languages, etc.), he may be able to just use UTF-16, ignoring composite characters or characters outside of the basic encoding plane, or he may have to handle composite characters even in UTF-32. (With regards the UTF-16 vs. UTF-32: Windows and AIX are UTF-16, as is Java. Most other Unices are UTF-32, although there is likely still support floating around for earlier wide character encodings.) — James Kanze, Aug 04 '14 at 11:43
He might also want to look at ICU, which is reportedly very complete. — James Kanze, Aug 04 '14 at 11:45
@JamesKanze, I just wrote an edit mentioning the problem with compound characters even with utf-32 at the same time with your comment. — eerorika, Aug 04 '14 at 11:51
@user2079303 I am reading strings from a file with getline(basic_istream&, string&); and I have read that using wstring is too complexe to use in linux ! — zuubs, Aug 04 '14 at 12:35
@user2079303 thanks for the useful information. I have tried the utf8 external library, it gives me the correct length for the stirng, but can't find a way to get the right encoded string that i could use substr with. Also, i couldn't use the portion of code above for converting to wstring, the make complains about "‘wstring_convert’ was not declared in this scope". wstring works when i do sth like: wstring str = L"mystring", but I m trying to convert a string variable. — zuubs, Aug 05 '14 at 11:28

substr with characters instead of bytes

1 Answers1

Linked