57

I just want to write some few simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?

Raedwald
  • 46,613
  • 43
  • 151
  • 237
poiloi
  • 579
  • 1
  • 4
  • 3
  • 15
    It is insane that std library is not able to deal with utf-8. This is why we have to deal tons of conversions between wide strings and byte strings with some awkward locale. Why there isn't after all these years anything like std::utf8string? – V-X Jan 16 '15 at 09:45
  • 6
    because C/C++ have to be compatible with non existing hardware? :P – CoffeDeveloper May 13 '15 at 12:53

9 Answers9

57

The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.

And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.

If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).

You may want to start your file with a byte order mark so that other programs will know it is UTF-8.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 2
    For completeness, add iterators to your first sentence, it's the same with them as with indexes. – sbi Jun 10 '10 at 06:41
  • 14
    A lot of programs choke on the BOM when they read UTF-8, and it will cause some programs to think the text is UTF-16. – Tim Seguine Sep 03 '13 at 17:35
  • 1
    @TimSeguine: That's just a long way of saying that a lot of programs have no or very poor support for UTF-8. – Ben Voigt Jun 13 '15 at 18:50
  • 2
    True, but it is a common, very specific way of having poor support that is worth knowing about should one encounter problems using it. – Tim Seguine Jun 18 '15 at 09:42
  • 6
    BOM codes tell you which of two possible byte orderings are employed by a utf16 or utf32 stream. They don't even make sense for a utf8 stream. – seattlecpp Jul 17 '15 at 04:33
  • 1
    Incorrect. While "byte order" isn't an issue for UTF-8, the byte order mark is still useful for distinguishing encodings. – Ben Voigt Jul 17 '15 at 05:43
  • 3
    Indeed the exact quote from [Unicode.org](http://www.unicode.org/faq/utf_bom.html) is: **Q:** Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? **A:** Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. *I take this to mean "indicates that it is UTF8 encoding"!* – SlySven Feb 08 '16 at 03:45
  • It's also possible for `std::string` to hold invalid utf8 codepoints. – jupp0r Oct 26 '16 at 12:24
  • @seattlecpp The UTF-8 "byte order mark" is used to distinguish encodings, not byte order; I've seen an alternate term used for the Unicode BOMs in general, the "Unicode signature", which I feel to be more appropriate, given that it indicates 1) which variant of Unicode is in use, and 2) the byte order (when applicable). – Justin Time - Reinstate Monica Apr 24 '17 at 17:14
  • The problem with the UTF-8 BOM/Unicode signature is that the Unicode Standard sends mixed messages about it. They don't require or recommend it, but they don't explicitly disrecommend it, either. They also don't recommend removing it if it's already there. This sends a "go with the flow" vibe, which has the result of UTF-8 BOM support being a mess; they should say either "always use it" or "never use it", but it's probably too late for that now (because either one would be a _major_ breaking change). – Justin Time - Reinstate Monica Apr 24 '17 at 17:19
  • You are assuming that the OP has a `std::string` that holds UTF-8. – Raedwald Dec 12 '17 at 23:06
  • @Raedwald: No, I'm informing him that `std::string` is capable of holding UTF-8. The question is asking for a design recommendation, not for help with fixing an existing design. – Ben Voigt Dec 12 '17 at 23:18
  • A general rule of thumb for including a UTF-8 BOM: Linux will assume UTF-8 by default so it doesn't need a BOM. Windows will assume a legacy code page by default, so you need the BOM to indicate that the file is UTF-8 instead. – Mark Ransom Dec 12 '17 at 23:25
24

There is nice tiny library to work with utf8 from c++: utfcpp

denys
  • 2,437
  • 6
  • 31
  • 55
10

libiconv is a great library for all our encoding and decoding needs.

If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.

Brian R. Bondy
  • 339,232
  • 124
  • 596
  • 636
10

What is the easiest and simple way to do so?

The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string. As the internet still lacks of one, I went to implement the functionality on my own:

tinyutf8 (EDIT: now Github).

This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).

Hope this helps!

Jakob Riedle
  • 1,969
  • 1
  • 18
  • 21
  • Your library looks quite good but its license is very limiting. – Cem Kalyoncu Sep 03 '16 at 14:30
  • 1
    In what way is it limiting? What Licence do you want me to publish it under? – Jakob Riedle Sep 05 '16 at 05:34
  • 3
    GPL means, if I include your header in my program, I have to make my program GPL as well. Quite limiting don't you think? I would recommend BSD style license for a small library like this. – Cem Kalyoncu Sep 05 '16 at 06:53
  • Ok, I will change it to BSD-3 as soon as I find the time to. For now, I hereby grant you the use of tinyutf8 as specified by BSD-3, a.k.a. "New BSD License" :D Thanks for your feedback, I appreciate it! – Jakob Riedle Sep 05 '16 at 09:24
  • 1
    Personally, I would keep it GPL and provide an additional commercial (ask money for it) license for those who want to make money out of your work. – Adrian Maire Mar 01 '17 at 09:50
7

If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.

Tony the Pony
  • 40,327
  • 71
  • 187
  • 281
  • 1
    I'm guessing he has some other characters though that he needs encoding that he is storing inside his string. But maybe not :) – Brian R. Bondy Jun 10 '10 at 01:39
5
std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());    
std::string str_std( byteArray.constData(), byteArray.length());
Danil
  • 701
  • 8
  • 7
0

Use Glib::ustring from glibmm.

It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).

  • Why was this downvoted? By virtue of being used in `glibmm`, `gtkmm`, and all dependent projects (including InkScape), this is a widely used and thus fairly battle-tested UTF8-string class. Why is that not worth a mention? – underscore_d Apr 07 '22 at 08:47
0

My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.

#include <utf/utf.h>

int main()
{
    using namespace utf;

    u32string u32_text = U"ɦΈ˪˪ʘ";
    // do stuff with string
    // convert to utf8 string
    utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
    utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());

    u8string u8_text(pos, end);

    // write out utf8 to file.
    // ...
}
rmawatson
  • 1,909
  • 12
  • 20
-28

As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.

So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols (languages) in single.

It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.

Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!

I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.

maazza
  • 7,016
  • 15
  • 63
  • 96
Anatoly
  • 23
  • 1
  • 5
    What exactly do you mean by "normal Unicode"? I am going to assume you mean what most Java and Windows programmers think Unicode means: UTF16. This is also not a constant width encoding (not every character takes exactly 2 bytes). Approximately half of Internet users are from China. Very rare! – Tim Seguine Sep 03 '13 at 16:38
  • 2
    @Anatoly - some background reading: http://www.joelonsoftware.com/articles/Unicode.html, http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/, http://www.utf8everywhere.org/. If you only read one, read the first of these. You may change your recommendation to never use UTF-8! – Matt Wallis Oct 25 '13 at 12:40
  • 2
    The reason to use utf-8 is that it can encode all Unicode code points and that it is memory efficient for Latin languages. The drawback indeed is that you have variable length encoding. Note that there is a difference between utf-16 and ucs-2. The ucs-2 is the one you mention: fixed 2 bytes per character but as drawback that it cannot encode all code points. – gast128 Dec 04 '14 at 11:36