0

To split an std::string into characters I can just iterate over the string. However, this doesn't work if the string contains german umlauts ä,ö,ü,ß,....

I found a solution using std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> that works for me. But it feels too complicated, is there a nicer solution?

#include <string>
#include <vector>
#include <iostream>
#include <locale>
#include <codecvt>

// Works with umlauts:
std::vector<std::string> split_wstring(const std::string &word) {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    std::wstring wword = converter.from_bytes(word);
    std::vector<std::string> characters;
    for (auto iter : wword) {
        characters.push_back(converter.to_bytes(iter));
    }
    return characters;
}

// Works fine for english words but fails for umlauts:
std::vector<std::string> split_string(const std::string &word) {
    std::vector<std::string> characters;
    for (auto iter : word) {
        characters.push_back(&iter);
    }
    return characters;
}

int main() {
    for (auto c : split_string("AbcühßtÖ")) {
        std::cout << "Split String: " << c << std::endl;
    }
    for (auto c : split_wstring("AbcühßtÖ")) {
        std::cout << "Split W-String: " << c << std::endl;
    }
}

(I split the words into std::strings of length 1 instead of chars, because I need them to be std::strings anyway)

Output is:

Split String: A
Split String: b
Split String: c
Split String: �
Split String: �
Split String: h
Split String: �
Split String: �
Split String: t
Split String: �
Split String: �
Split W-String: A
Split W-String: b
Split W-String: c
Split W-String: ü
Split W-String: h
Split W-String: ß
Split W-String: t
Split W-String: Ö

There is a similar post: C++ iterate utf-8 string with mixed length of characters The solution there is to use lengthy thirdparty code. I think my solution with the wstring converter is already nicer.

Johannes
  • 3,300
  • 2
  • 20
  • 35
  • Well, you need more than ascii characters for these characters to be handled properly, so using unicode/wstring may be the only option? Or are you using a 1-bye encoding? – Matthieu Brucher Nov 28 '18 at 14:48
  • 1
    Possible duplicate of [C++ iterate utf-8 string with mixed length of characters](https://stackoverflow.com/questions/40054732/c-iterate-utf-8-string-with-mixed-length-of-characters) – The Quantum Physicist Nov 28 '18 at 14:54
  • 1
    Note: `for (auto iter : word) { characters.push_back(&iter); }` does not do what you think it does. `word` is a `std::string`, so `iter` is a single `char`, and `&iter` is a `char*` without a null-terminator. To push individual characters into the `vector`, you would have to remove the `&`, but `std::string` does not have a constructor that takes a single `char` as its sole input, so you would have to use `characters.push_back(std::string(1, iter));` or `characters.push_back(std::string(&iter, 1));`. Or, you can use `std::vector characters(word.begin(), word.end());` instead. – Remy Lebeau Nov 29 '18 at 02:17
  • Also, "*Works fine for english words but fails for umlauts*" depends on the particular encoding of the `std::string`. If the string is encoded as UTF-8, then certainly yes, it will not work, since the umlauts (any non-ASCII character in general) will require more than 1 `char` to encode it. But, if the string is encoded in a single-byte encoding like Windows-1252 or ISO-8859-1, then the umlauts will work fine. – Remy Lebeau Nov 29 '18 at 02:21
  • 2
    Also, "*my solution with the wstring converter*" will not work for Unicode characters that require more than 1 `wchar_t` to encode them in UTF-16 (codepoints outside the BMP, ie U+10000 and higher). – Remy Lebeau Nov 29 '18 at 02:23
  • 1
    You might consider changing your functions to accept an array of `char*`/`wchar_t*` pointers (or an array of `std::(w)string` strings) as input, then you will be able to split the input string using multi-character delimiters. – Remy Lebeau Nov 29 '18 at 02:27
  • 1
    The proposed solution in [C++ iterate utf-8 string with mixed length of characters](https://stackoverflow.com/questions/40054732/c-iterate-utf-8-string-with-mixed-length-of-characters) doesn't need external libraries. – Barmak Shemirani Nov 29 '18 at 02:49

1 Answers1

0

Thanks for all the replies, they helped me understand that the conversion to Utf-16 or Utf-32 is not the best approach.

I took another look at this answer and wrote an iterator based on it. I could confirm that it works for utf-8 strings with characters of different bytelengths.

#include <string>
#include <vector>
#include <iostream>


class UtfIterator {
public:
    std::string::const_iterator str_iter;
    size_t cplen;

    UtfIterator(const std::string::const_iterator str_iter) : str_iter(str_iter) {
        find_cplen();
    }

    std::string operator*() const {
        return std::string(str_iter, str_iter + cplen);
    }

    UtfIterator& operator++() {
        str_iter += cplen;
        find_cplen();
        return *this;
    }

    bool operator!=(const UtfIterator &o) const {
        return this->str_iter != o.str_iter;
    }
private:
    void find_cplen() {
        cplen = 1;
        if((*str_iter & 0xf8) == 0xf0) cplen = 4;
        else if((*str_iter & 0xf0) == 0xe0) cplen = 3;
        else if((*str_iter & 0xe0) == 0xc0) cplen = 2;
        // if(iter + cplen > text.length()) cplen = 1;
    }
};

int main() {
    std::string s("今天周五123äöÜß");
    for (UtfIterator iter(s.begin()); iter != UtfIterator(s.end()); ++iter) {
        std::cout << "char: " << *iter << std::endl;
    }
}

About that uncommented line: As far as I understand its purpose is to find broken Utf-8 strings that have missing bytes in the end. I could not find a way to implement this in my Iterator without knowing the end() iterator. Any ideas?

Johannes
  • 3,300
  • 2
  • 20
  • 35
  • 1
    I suggest your iterator return `std::string_view` instead. Yes, SSO will avoid allocations anyway, but it's still a bit more expensive. Also, there are lots of possible Unicode-Iterators: Code-Unit (Maybe converting the source), Code-Point, grapheme, grapheme-cluster and more, so go for a more descriptive name. – Deduplicator Nov 29 '18 at 12:03
  • I'm on c++14, so I can not use `std::string_view` but thanks for the hint. – Johannes Nov 29 '18 at 14:30