To split an std::string
into characters I can just iterate over the string. However, this doesn't work if the string contains german umlauts ä,ö,ü,ß,...
.
I found a solution using std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>
that works for me. But it feels too complicated, is there a nicer solution?
#include <string>
#include <vector>
#include <iostream>
#include <locale>
#include <codecvt>
// Works with umlauts:
std::vector<std::string> split_wstring(const std::string &word) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring wword = converter.from_bytes(word);
std::vector<std::string> characters;
for (auto iter : wword) {
characters.push_back(converter.to_bytes(iter));
}
return characters;
}
// Works fine for english words but fails for umlauts:
std::vector<std::string> split_string(const std::string &word) {
std::vector<std::string> characters;
for (auto iter : word) {
characters.push_back(&iter);
}
return characters;
}
int main() {
for (auto c : split_string("AbcühßtÖ")) {
std::cout << "Split String: " << c << std::endl;
}
for (auto c : split_wstring("AbcühßtÖ")) {
std::cout << "Split W-String: " << c << std::endl;
}
}
(I split the words into std::strings of length 1 instead of chars, because I need them to be std::strings anyway)
Output is:
Split String: A
Split String: b
Split String: c
Split String: �
Split String: �
Split String: h
Split String: �
Split String: �
Split String: t
Split String: �
Split String: �
Split W-String: A
Split W-String: b
Split W-String: c
Split W-String: ü
Split W-String: h
Split W-String: ß
Split W-String: t
Split W-String: Ö
There is a similar post: C++ iterate utf-8 string with mixed length of characters The solution there is to use lengthy thirdparty code. I think my solution with the wstring converter is already nicer.