11

I am trying to iterate through a UTF-8 string. The problem as I understand it is that UTF-8 characters have variable length, so I can't just iterate char-by-char but I have to use some kind of conversion. I am sure there is a function for this in the modern C++ but I don't know what it is.

#include <iostream>
#include <string>

int main()
{
  std::string text = u8"řabcdě";
  std::cout << text << std::endl; // Prints fine
  std::cout << "First letter is: " << text.at(0) << text.at(1) << std::endl; // Again fine. So 'ř' is a 2 byte letter?

  for(auto it = text.begin(); it < text.end(); it++)
  {
    // Obviously wrong. Outputs only ascii part of the text (a, b, c, d) correctly
    std::cout << "Iterating: " << *it << std::endl; 
  }
}

Compiled with clang++ -std=c++11 -stdlib=libc++ test.cpp

From what I've read wchar_t and wstring should not be used.

Jan Šimek
  • 656
  • 8
  • 21
  • There is no such thing as "UTF-8 characters". Until you're familiar with the subject matter, it will be frustrating and unrewarding to jump into writing code. – Kerrek SB Sep 27 '14 at 11:21
  • Are you on some Unixoid or on windows? And do you want codeunits, codepoints or graphemes? (Character is ludicrously context-dependent (and even the context might not be enough to decide), and there's extra hurt in store on windows) – Deduplicator Sep 27 '14 at 11:21
  • 1
    You may want to take a look [here](http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes). Bear in mind it doesn't work in gcc, they have not implemented this part of the standard yet, but works in clang/libc++ and should work with VS2013 IIRC. – n. m. could be an AI Sep 27 '14 at 11:38
  • @Deduplicator OS X, but I am looking for a cross-platform solution. Graphemes - I simply want to divide the string into separate letters. – Jan Šimek Sep 27 '14 at 12:23
  • possible duplicate of [Cross-platform iteration of Unicode string (counting Graphemes using ICU)](http://stackoverflow.com/questions/4579215/cross-platform-iteration-of-unicode-string-counting-graphemes-using-icu) – Deduplicator Sep 27 '14 at 13:00
  • @n.m. Thank you, that works and it is exactly what I've been looking for (although it is a shame that gcc doesn't support it yet). You can submit that as an answer. – Jan Šimek Sep 27 '14 at 14:44
  • Re graphemes: u8"řabcdě" is textually equivalent to u8"r\u030Cabcde\u030C" and u8"řCabcde\u030C" and u8"r\u030Cabcdě"; They all have the same 6 letters in the same sequence (in the same case, too). – Tom Blodget Sep 27 '14 at 18:32

1 Answers1

4

As n.m. suggested I used std::wstring_convert:

#include <codecvt>
#include <locale>
#include <iostream>
#include <string>

int main()
{
  std::u32string input = U"řabcdě";

  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;

  for(char32_t c : input)
  {
    std::cout << converter.to_bytes(c) << std::endl;
  }
}

Perhaps I should've specified more clearly in the question that I wanted to know if this was possible to do in C++11 without the use of any third party libraries like ICU or UTF8-CPP.

Jan Šimek
  • 656
  • 8
  • 21
  • What version of g++ did you use? it might be part of C++14 – Splash Nov 09 '15 at 03:24
  • I use clang: Apple LLVM version 7.0.0 (clang-700.0.72), but this is all C++11. You can check at http://en.cppreference.com – Jan Šimek Nov 09 '15 at 06:19
  • I was running at http://en.cppreference.com/w/cpp/locale/codecvt_utf8, and chose the 4.9 version C++11, and it doesn't compile. Could you take a look? – Splash Nov 09 '15 at 17:27
  • The example was missing #include . I just tested it with GCC 5.2.0 on linux. – Jan Šimek Dec 07 '15 at 12:04