9

I need to get a substring of the first N characters in a std::string assumed to be utf8. I learned the hard way that .substr does not work... as... expected.

Reference: My strings probably look like this: mission:\n\n1億2千万匹

dyp
  • 38,334
  • 13
  • 112
  • 177
Jonny
  • 15,955
  • 18
  • 111
  • 232
  • 6
    The problem is that UTF-8 is a variable-length encoding, each character can be one to six bytes. While you can use `std::string` to store UTF-8 strings, you can't use the standard functions straight off. You *can* use the `substr` function, but you have to use some special code to find the actual start and end of the substring. Unless you're worried about space, you might want to store strings in a fixed-length encoding internally, like UTF-32. – Some programmer dude Jun 23 '15 at 06:26
  • Like [this](http://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11) link says: "Unicode is not supported by Standard Library (for any reasonable meaning of supported). std::string is no better than std::vector: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes." – paulsm4 Jun 23 '15 at 06:45
  • 6
    Even with UTF-32, you could slice off combining characters (e.g. accents) unintentionally. If you really need, I'd consider ICU (http://site.icu-project.org) or some similar library tailored to handling Unicode in all its glory. – Ulrich Eckhardt Jun 23 '15 at 06:46
  • 1
    If you look at the [description of utf-8 on wikipedia](https://en.wikipedia.org/?title=UTF-8#Description), you'll see the first byte in each character encoding tells you how many total bytes that logical character contains. It's then easy to skip to the next character. You can use this approach to move into the string, and advance a desired number of utf-8 characters, passing the offsets to `substr()`. I don't know what your other utf-8 related needs are, but for this alone it seems unnecessary to find a utf-8 library. – Tony Delroy Jun 23 '15 at 06:49
  • `std:string` is used for ASCII (single-byte) strings *only*. No encodings are assumed by C++ or the standard library. If you want to use Unicode (specifically the fixed-width subset of UTF-16) use `std::wstring`. UTF8 is essentially a binary encoding of a string that needs to be decoded before you can treat it a string in any language. – Panagiotis Kanavos Jun 23 '15 at 07:31
  • 3
    Thanks everyone. I think std::string could be typedefd to std::abunchofbytes – Jonny Jun 23 '15 at 07:34
  • @Jonny No, `std::string` is *very* specific. It's an ASCII string, which is *more* than just a `char` array. Just use the *correct* type for Unicode strings - `wstring`. Just like `string`, `wstring` is *more* than just a `wchar` array. UTF8 is *not* a string, it's an encoded blob that must be decoded to actual strings before it can be used – Panagiotis Kanavos Jun 23 '15 at 07:37
  • 7
    @PanagiotisKanavos: Please stop saying that `std::string` is in any way related to ASCII, it's incorrect. As a rule of thumb, if you don't know what an encoding is, or the difference between character, glyph, codepoint and grapheme, you have no business slicing unicode strings; just pass it over to a library written by someone who does (like ICU). – DanielKO Jun 23 '15 at 07:49
  • @DanielKO working in a non-English country, I deal with Unicode exclusively. I do know of *all* these things, even remember the pre-Win95 days when apps had to use custom digrams to display non-Latin characters. If I wanted to use Unicode today, I'd look to u16string or u32string and check compiler support for them. I *wouldn't* use `string` because it would always be too small for my data – Panagiotis Kanavos Jun 23 '15 at 08:17
  • 2
    @PanagiotisKanavos If you know better, please stop posting bad advice. Converting to `wstring` of 16 bit characters will only make the bug less common, it won't fix anything. Greek might fit within UCS-2 (I honestly do not know), but Chinese *does not* -- the OP is explicitly working with Chinese. And `std::string` is not ASCII. UTF8 encoded characters *is* a string. Expecting your unicode not have multi-byte characters (or multi-wchar_t characters) is asking for bugs to occur. And working in UCS-4/UTF-32 is rarely worth it. – Yakk - Adam Nevraumont Jun 23 '15 at 14:32
  • @Jonny: Can you please explain **why** you need the first N characters? It makes no linguistic sense, [especially in Unicode](http://utf8everywhere.org/#faq.glossary). Also do you want to count **codepoints** or **graphemes**? – Yakov Galka Jun 25 '15 at 12:21
  • Good question. I coded an animation of someone typing character for character, like most of us do. I hope that makes sense, linguistically. – Jonny Jun 25 '15 at 12:33
  • This worked for me: http://stackoverflow.com/a/11946973/264619 – ftvs Aug 11 '15 at 12:00

4 Answers4

5

I found this code and am just about to try it out.

std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng)
{
    if (leng==0) { return ""; }
    unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        if (q==start){ min=i; }
        if (q<=start+leng || leng==std::string::npos){ max=i; }

        c = (unsigned char) str[i];
        if      (
                 //c>=0   &&
                 c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return "";//invalid utf8
    }
    if (q<=start+leng || leng==std::string::npos){ max=i; }
    if (min==std::string::npos || max==std::string::npos) { return ""; }
    return str.substr(min,max);
}

Update: This worked well for my current issue. I had to mix it with a get-length-of-utf8encoded-stdsstring function.

This solution had some warnings spat at it by my compiler:

Some warnings spit out by my compiler.

Jonny
  • 15,955
  • 18
  • 111
  • 232
  • C++ already has a 2-byte string, `std:wstring`, which is supported by algorithms. It's better to convert UTF8 content to Unicode and `wstring` while reading that rewriting every algorithm to handle "magic" strings that use the ASCII string type (std::string) but behave as something else – Panagiotis Kanavos Jun 23 '15 at 07:32
  • 14
    @PanagiotisKanavos: `std::wstring` is not 2-byte. Please read http://utf8everywhere.org/ – DanielKO Jun 23 '15 at 07:51
  • @DanielKO C++ references are preferable and yes, wchar_t is 2 or more bytes dependent on implementation - so char16_t or char32_t are preferable. I see things are in flux again and we have Unicode literals that are mapped to `char16_t*` or `char32_t`. There are also UTF8 encoded literals that map to `char*`! There's also u16string and u32string. Don't know about STL support for them - who moved my cheese! – Panagiotis Kanavos Jun 23 '15 at 08:18
  • 6
    @PanagiotisKanavos If you convert to `wstring` and expect every character to fit in a `wchar_t`, you are going to have harder to find bugs, but no fewer bugs. Unicode characters do not all fit in 2 bytes. And using 32 bit-per-char strings is ridiculously bulky. Dealing with utf8 is almost always more efficient: hiding multi-byte behind `wchar_t` doesn't *work*. – Yakk - Adam Nevraumont Jun 23 '15 at 13:36
  • @Jonny it appears you are handling surrogates in a half-correct manner. Character, as a concept, should probably include things combining characters, because doing the equivalent of splitting the accent off the e is probably a bad idea. Welcome to the wonders of text processing! – Yakk - Adam Nevraumont Jun 23 '15 at 13:45
  • 1
    Here is a wikipedia article on combining characters: https://en.wikipedia.org/wiki/Combining_character -- splitting a combining character from the character before hand is probably a bad idea. Another problem is left-to-right and right-to-left marks: https://en.wikipedia.org/wiki/Left-to-right_mark https://en.wikipedia.org/wiki/Right-to-left_mark -- and we are probably just getting started. – Yakk - Adam Nevraumont Jun 23 '15 at 13:53
  • See: https://en.wikipedia.org/wiki/Unicode_control_characters and https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29 for more things to consider. – Yakk - Adam Nevraumont Jun 23 '15 at 14:16
  • 5
    I think this function is using substr incorrectly. This function didn't work properly for me until I changed the return line to : return str.substr(min,max-min); – John Bowers Dec 29 '15 at 16:02
4

You could use the boost/locale library to convert the utf8 string into a wstring. And then use the normal .substr() approach:

#include <iostream>
#include <boost/locale.hpp>

std::string ucs4_to_utf8(std::u32string const& in)
{
    return boost::locale::conv::utf_to_utf<char>(in);
}

std::u32string utf8_to_ucs4(std::string const& in)
{
    return boost::locale::conv::utf_to_utf<char32_t>(in);
}

int main(){

  std::string utf8 = u8"1億2千万匹";

  std::u32string part = utf8_to_ucs4(utf8).substr(0,3);

  std::cout<<ucs4_to_utf8(part)<<std::endl;
  // prints : 1億2
  return 0;
}
  • 2
    `wstring` does not store single characters in single `wchar_t` in the general case. This only works in a restricted subset of unicode. Your function names are wrong: ucs4 does not fit in a 16 bit `wchar_t`. – Yakk - Adam Nevraumont Jun 23 '15 at 13:49
  • 1
    @Yakk You are right. I mixed it up with char32_t, which is always 32 bit - corresponding to an ucs4 encoding. (I changed the code snippet accordingly.) – Gunnar Klämke Jun 23 '15 at 14:02
  • 3
    Still missing combining character support. You probably are dealing with left-to-right and right-to-left markers in unexpected ways https://en.wikipedia.org/wiki/Bi-directional_text#Unicode_bidi_support . [Combining Grapheme Joiner](https://en.wikipedia.org/wiki/Combining_Grapheme_Joiner), [Combining Character](https://en.wikipedia.org/wiki/Combining_character), the [BOM](https://en.wikipedia.org/wiki/Byte_order_mark), (depricated Language tags), [variation selectors](https://en.wikipedia.org/wiki/Variant_form_%28Unicode%29), etc. – Yakk - Adam Nevraumont Jun 23 '15 at 14:15
  • Yes, that may be true. But if the variable-length encoding of UTF-8 was the main problem in this question, the solution should work. – Gunnar Klämke Jun 23 '15 at 14:22
  • The question was about C++11 – Ident Nov 08 '15 at 12:44
0

Based on this answer I've written my utf8 substring function:

void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
    int len = 0, byteIndex = 0;
    const char* aStr = originalString.c_str();
    size_t origSize = originalString.size();

    for (byteIndex=0; byteIndex < origSize; byteIndex++)
    {
        if((aStr[byteIndex] & 0xc0) != 0x80)
            len += 1;

        if(len >= SubStrLength)
            break;
    }

    csSubstring = originalString.substr(0, byteIndex);
}
Community
  • 1
  • 1
Atul
  • 3,778
  • 5
  • 47
  • 87
0

You could use the std library to convert the utf8 string into a wstring. And then use the normal .substr() approach:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

std::string ucs4ToUtf8(const std::u32string& in)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes(in);
}

std::u32string utf8ToUcs4(const std::string& in)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.from_bytes(in);
}

int main(){

  std::string utf8 = u8"4ą5źćęł";

  std::u32string part = utf8ToUcs4(utf8).substr(0,3);

  std::cout<<ucs4ToUtf8(part)<<std::endl;
  // prints : 4ą5
  return 0;
}
Huberti
  • 47
  • 6