STL and UTF-8 file input/output. How to do it?

Question

I use wchar_t for internal strings and UTF-8 for storage in files. I need to use STL to input/output text to screen and also do it by using full Lithuanian charset.
It's all fine because I'm not forced to do the same for files, so the following example does the job just fine:

#include <io.h>
#include <fcntl.h>
#include <iostream>
    _setmode (_fileno(stdout), _O_U16TEXT);
    wcout << L"AaĄąﬂ" << endl;

But I became curious and attempted to do the same with files with no success. Of course I could use formatted input/output, but that is... discouraged.

    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    _fwprintf_p (fp, L"AaĄą\nﬂ");
    fclose (fp);
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmode (_fileno (fp), _O_U8TEXT);
    wchar_t text[256];
    fseek (fp, NULL, SEEK_SET);
    fwscanf (fp, L"%s", text);
    wcout << text << endl;
    fwscanf (fp, L"%s", text);
    wcout << text << endl;
    fclose (fp);

This snippet works perfectly (although I am not sure how it handles malformed chars). So, is there any way to:

get FILE* or integer file handle form a std::basic_*fstream?
simulate _setmode () on it?
extend std::basic_*fstream so it handles UTF-8 I/O?

Yes, I am studying at an university and this is somewhat related to my assignments, but I am trying to figure this out for myself. It won't influence my grade or anything like that.

Googling "stl utf8" yields http://www.codeproject.com/KB/stl/utf8facet.aspx about midway on the first page. — zneak, Oct 25 '10 at 20:10
@zneak Oh, my. I didn't expect a LMGTFY (let alone with my own keywords). Anyway, this is exactly what I was looking for, although I was hoping for something that would be possible to memorize. Silly me. — transistor09, Oct 26 '10 at 15:58

score 3 · Accepted Answer · edited May 23 '17 at 12:28

Use std::codecvt_facet template to perform the conversion.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.pubimbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system.

score 2 · Answer 2 · answered Oct 26 '10 at 19:55

Well, after some testing I figured out that FILE is accepted for _iobuf (in the w*fstream constructor). So, the following code does what I need.

#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
//For writing
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    wofstream fs (fp);
    fs << L"ąﬂ";
    fclose (fp);
//And reading
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmode (_fileno (fp), _O_U8TEXT);
    wifstream fs (fp);
    wchar_t array[6];
    fs.getline (array, 5);
    wcout << array << endl;//For debug
    fclose (fp);

This sample reads and writes legit UTF-8 files (without BOM) in Windows compiled with Visual Studio 2k8.

Can someone give any comments about portability? Improvements?

Thank you verymuch :) This solved my mind buggling question:) — Hossein, Jul 21 '13 at 10:42

score 1 · Answer 3 · edited May 23 '17 at 11:56

1

The easiest way would be to do the conversion to UTF-8 yourself before trying to output. You might get some inspiration from this question: UTF8 to/from wide char conversion in STL

edited May 23 '17 at 11:56

Community

1
1

answered Oct 25 '10 at 20:12

Mark Ransom

299,747
42
398
622

I know how to convert (I wrote a decoder), but I was wondering if there was a less troublesome way. – transistor09 Oct 25 '10 at 20:25
IMHO codecvt facet (once implemented) is a convenient way to perform conversions in STL. Why troublesome? – Basilevs Oct 26 '10 at 17:02
_Less troublesome_ would be if you could memorize it. Despite that, I agree that facet is convenient in most cases (until standards offer something internal). – transistor09 Oct 26 '10 at 18:33

score 0 · Answer 4 · edited May 23 '17 at 12:06

0

get FILE* or integer file handle form a std::basic_*fstream?

Answered elsewhere.

edited May 23 '17 at 12:06

Community

1
1

answered Oct 25 '10 at 20:13

Mike DeSimone

41,631
10
72
96

Um, scratch that, FILE* fp; wofstream fs (fp); seems to work just fine! – transistor09 Oct 26 '10 at 18:59

score -3 · Answer 5 · answered Oct 25 '10 at 20:07

-3

You can't make STL to directly work with UTF-8. The basic reason is that STL indirectly forbids multi-char characters. Each character has to be one char/wchar_t.

Microsoft actually breaks the standard with their UTF-16 encoding, so maybe you can get some inspiration there.

answered Oct 25 '10 at 20:07

Šimon Tóth

35,456
20
106
151

In first snippet if I omit _setmode wcout doesn't work. I thought it would be same with files too. And after executing second snippet I tested it, the output file was a legit UTF-8. – transistor09 Oct 25 '10 at 20:21
@transistor09 Yes, you can read and output UTF-8, but you can't store UTF-8 in any other way then raw data or encoded in UTF-32 (UTF-16 in Windows). – Šimon Tóth Oct 25 '10 at 20:40
@Let_Me_Be That's what I want: keep wchar_t in RAM and store UTF-8 in HDD. – transistor09 Oct 25 '10 at 21:07
-1, simply untrue. `char` can have a multibyte encoding, and `std::string` doesn't change that. The Standard Library explicitly confirms it, in fact: There's no reason to have `std::codecvt::max_length` if it would be always be 1 – MSalters Oct 26 '10 at 08:52
@MSalters That specifies the number of external characters matching one internal! Not the other way around. – Šimon Tóth Oct 26 '10 at 10:03
@transistor09 Check the UTF-8 facet mentioned in the comment to your question. That is the correct approach. – Šimon Tóth Oct 26 '10 at 10:04
@Let_Me_Be: The specialiazation `codecvt` is a no-op. In the specialization `codecvt`, the "internal" character is `wchar_t` and the "external" characters are chars, up to `char[codecvt::max_length]` – MSalters Oct 26 '10 at 10:58
@MSalters I have no idea why you are arguing with me. What I saying is that you can't have an UTF-8 or UTF-16 std::string (std::basic_string) and not break the standard. The external encoding can be anything. I'm talking about the internal encoding. – Šimon Tóth Oct 26 '10 at 11:07
@Let_Me_Be: Sorry, but that is just not true. I cannot prove the absence of such a rule, but if you think there's a rule in the standard you should be able to point to it. Should be in chapter 20 somewhere? – MSalters Oct 26 '10 at 11:16
@MSalters It is indirectly forbidden. For example, the definition of null terminated wide character sequence defines the length as the number of elements before the NULL character. If you have a variable length encoding, then this is not true. – Šimon Tóth Oct 26 '10 at 11:34
@Let_Me_Be: A "null terminated wide character sequence" is not an STL or a `std::basic_string` concept. It's a basic C concept, carried over to C++. Besides, UTF-8 would be used for Multi-Byte encodings (`char[]`), not wide characters (`wchar_t[]`). – MSalters Oct 27 '10 at 08:12
@MSalters `char` has even stricter limitations then `wchar_t`. – Šimon Tóth Oct 27 '10 at 09:00
@Let_Me_Be : no it doesn't. The standard explicitly acknowledges multi-byte character sequences. The library agrees, there's no point in having mcslen unless it differs from strlen. Furthermore, that still is the C legacy. You _still_ haven't indicated how the STL (`std::basic_string`) would forbid it. – MSalters Oct 27 '10 at 09:53
@MSalters Well, obviously yes you need `wcslen` in both cases, simply because it takes `const wchar_t*` and not `const char*`. – Šimon Tóth Oct 27 '10 at 10:32
@Let_Me_Be :Sorry, mixup there. `mcslen` is not standard. But I found the even clearer example: `MB_CUR_MAX` is the maximum number of chars in a single multi-byte character. Eseentially, you state that the standard requires that `MB_CUR_MAX==1`. Why is it then even a #define ?! – MSalters Oct 27 '10 at 11:11
@MSalters `MB_CUR_MAX` should be number of bytes per `wchar_t` not number of `wchar_t` per platform-character. At least that is how I understand the description. – Šimon Tóth Oct 27 '10 at 11:17
A wide character is not a multibyte character, within the language of the standard. [defns.multibyte]: "a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment." [lib.locale.codecvt]: "The class codecvt is for use when converting from one codeset to another, such as from wide characters to multibyte characters." Also, the number of bytes per `wchar_t` is `sizeof wchar_t`. – Steve M Oct 27 '10 at 17:32
@Let_Me_Be: `MB_CUR_MAX` has nothing to do with `wchar_t`. – MSalters Oct 28 '10 at 11:27

STL and UTF-8 file input/output. How to do it?

5 Answers5

Linked