48

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?

Mr.C64
  • 41,637
  • 14
  • 86
  • 162
Abdelwahed
  • 1,694
  • 4
  • 21
  • 31
  • By "Unicode" do you mean UTF-8 or UTF-16? And what platform are you using? – dan04 Jan 23 '11 at 18:07
  • 3
    Read this article : [Reading UTF-8 with C++ streams](http://www.codeproject.com/KB/stl/utf8facet.aspx) – Nawaz Jan 23 '11 at 18:25
  • 5
    Another good article : [UTF-8 with C++ in a Portable Way](http://utfcpp.sourceforge.net/) – Nawaz Jan 23 '11 at 18:27
  • 4
    On windows, you should use std::string for UTF-8 and std::wstring for UTF-16. – anno Jan 23 '11 at 19:28

7 Answers7

44

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
LihO
  • 41,190
  • 11
  • 99
  • 167
  • 2
    Does that `new codecvt_utf8` require a corresponding `delete`? – Dmitri Nesteruk Sep 05 '16 at 06:45
  • 1
    No neet to explicitly delete codecvt_utf8. This is done in the destructor of std::locale when the refcounter of codecvt_utf8 becomes zero (see http://en.cppreference.com/w/cpp/locale/locale/%7Elocale) – MrTux Oct 14 '16 at 16:00
  • 2
    For those using this answer, std::locale::empty() has a problem on clang: error: no member named 'empty' in 'std::__1::locale'. – Felipe Valdes Mar 21 '19 at 22:55
  • 2
    Sadly, all of the useful parts of codecvt have been deprecated in C++20. – Bob Kline Nov 19 '20 at 14:10
14

According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

Philipp
  • 48,066
  • 12
  • 84
  • 109
  • 5
    Why don't you `delete converter`? – Mikhail Sep 28 '13 at 19:34
  • 1
    "Overload 7 is typically called with its second argument, f, obtained directly from a new-expression: the locale is responsible for calling the matching delete from its own destructor." [link](http://en.cppreference.com/w/cpp/locale/locale/locale) – sven Jul 29 '15 at 18:17
  • This works well. Curious, as I can't find a lot of info on it, and mine works fine without it, what is stream.imbue doing exactly? It seems as though it is setting some type of default type, but is this needed? Also, for first line remark, put your getline in a while(getline(stream, line)) loop to see more than the first line. – adprocas Sep 25 '16 at 03:55
12

Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

AshleysBrain
  • 22,335
  • 15
  • 88
  • 124
  • 3
    Might as well go the whole way: _wfopen(filename.c_str(), L"rt, ccs=UTF-8"); Conversion is now automatic. – Hans Passant Jan 23 '11 at 18:46
  • Actually, rolled it back, docs on the _wfopen say it converts to wide characters automatically, and this code doesn't take that in to account. – AshleysBrain Jan 23 '11 at 19:04
  • Only the filename. Quote: `Simply using _wfopen has no effect on the coded character set used in the file stream. ` – Hans Passant Jan 23 '11 at 20:04
  • Are you sure? The way I interpreted the docs, specifying `t` in the mode as well as `ccs=UTF-8` causes characters to be converted as they are read to and from the stream. – AshleysBrain Jan 23 '11 at 20:33
  • @Ashley: Yes, the quote refers to using `_wfopen` *without* the `ccs=` mode specifier. You need both `_wfopen` (according to the manual `_wfopen_s` is to be preferred) *and* `ccs=UTF-8`. – Philipp Jan 23 '11 at 20:42
  • Late edit in August: turns out @Hans Passant's way is better - edited the answer to use that instead! – AshleysBrain Aug 11 '11 at 15:42
4
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}
Shen Yu
  • 147
  • 1
  • 4
1

Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

Feel free to use standard functions other than gcount, and save the result of tellg to pos_type only. Also, be sure to pass separator to std::getline (if you don't do this, the function gives exception std::bad_cast)

Hedgeberry
  • 11
  • 3
0

This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.

Community
  • 1
  • 1
ThomasMcLeod
  • 7,603
  • 4
  • 42
  • 80
  • 1
    I think you can use wstring with UTF-16 – David Heffernan Jan 23 '11 at 19:02
  • @Daivd: Actually you are incorrect, and this is a common misunderstanding. UTF-16 covers 1,112,064 code points from 0 to 0x10FFFF. The scheme requires a variable length storage of either one or two 16-bit words, whereas UCS-2 was strictly one 16-bit word. If you trace back the definition wchar_t, you will find that it is has as it's root a primative type of 16-bits (usually a short). – ThomasMcLeod Jan 23 '11 at 19:59
  • 1
    @David: Technically, a `wstring` is just an array of 16-bit integers on Windows. You can store UCS-2 or UTF-16 data or whatever you like in it. Most Windows APIs do accept UTF-16 strings nowadays. – Philipp Jan 23 '11 at 20:08
  • @Philip I thought all Windows APIs are UTF-16 now. Which ones take UCS-2? – David Heffernan Jan 23 '11 at 20:10
  • @Thomas I'm afraid the misunderstanding is on you. I know about variable length of UTF-16 and surrogate pairs. But that is perfectly compatible with wstring. A surrogate pair takes 2 wchar_t elements. – David Heffernan Jan 23 '11 at 20:13
  • @Philipp: you can store a subset of UTF-16 characters in a wstring. For example, you cannot store the Balinese script characters in a wstring, but there are valid UTF-16 encodings for these characters. http://en.wikipedia.org/wiki/Balinese_script – ThomasMcLeod Jan 23 '11 at 20:15
  • @Thomas that's not correct. UTF-16 uses 16 bit code units, i.e. a wchar_t on Windows. – David Heffernan Jan 23 '11 at 20:18
  • @Thomas I have to agree with David. You can store any Unicode code point in a `wstring` if you treat it as an UTF-16 string. Non-BMP code points will need two code units, but there's nothing wrong with that. – Philipp Jan 23 '11 at 20:22
  • @Philipp: scatch my previous. I meant to refer to the Brāhmī script, which is even more obscure – ThomasMcLeod Jan 23 '11 at 20:23
  • @David: I think (but I'm not sure, I'm not using Windows right now) that the console still doesn't handle non-BMP characters. It is debatable whether that has something to do with the API itself. – Philipp Jan 23 '11 at 20:23
  • 1
    @Thomas anything with a defined Unicode code point can be represented in UTF-16 – David Heffernan Jan 23 '11 at 20:24
  • 1
    @Philipp the console is a whole world of pain! Even getting it to display non ANSI code points is an exercise of extreme masochism! – David Heffernan Jan 23 '11 at 20:25
  • @David: No, it's two lines, see http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx – Philipp Jan 23 '11 at 20:33
  • 1
    @Philipp Very interesting! I'm used to Python on Windows which has rubbish console support. – David Heffernan Jan 23 '11 at 20:36
  • @David: We seem to be arguing about semantics. You said "I think in can **use** wstring with UTF-16." That means more than store. It means store and have it interpreted correctly by at least stdio. I just tried using SMP characters with wcout and a wstring on Windows 7 pro 64-bit, and got a whole lot of gibberish. – ThomasMcLeod Jan 23 '11 at 20:38
  • @Thomas That doesn't mean the problem is with `wstring`. – David Heffernan Jan 23 '11 at 20:40
  • 1
    @David I think that's a Python problem, not a Windows problem. I know the Python devs try hard to get Unicode support everywhere, but I think it's hard to bring the actual Windows semantics to a model that assumes that operating system streams are always byte-based and encoding-agnostic (that is true for Unix file and console streams and for Windows file streams, but not for the Windows console). I haven't studied the Python source code, but I think that at least some time in the past they assumed this model to hold. – Philipp Jan 23 '11 at 20:48
  • @Philipp It's just a real shame that the Windows console feels a little neglected. – David Heffernan Jan 23 '11 at 20:49
  • 1
    @Thomas: I don't think the MSVC++ `iostreams` library does any kind of Unicode except allowing Unicode file names. All solutions for using Unicode in C++ are effectively pure C solutions, either using the Windows API directly or using nonstandard extensions to the C library. – Philipp Jan 23 '11 at 20:50
  • @Philipp, I agree. That's why I say that wstring is UCS-2 and not UTF-16. – ThomasMcLeod Jan 23 '11 at 20:53
  • @David: the problem is not with wstring storage, it's with typical wstring usage and UTF-16. Can can store UTF-16 in a bitset if you want, but is that using it with UTF-16? Not really. – ThomasMcLeod Jan 23 '11 at 20:59
  • 1
    @thomas what would you use instead of wstring? – David Heffernan Jan 23 '11 at 21:09
  • @Thomas: The MSVC++ standard library doesn't support UCS-2 either. Last time I checked, the C++ locales didn't support any Unicode locale, making Unicode output essentially impossible. – Philipp Jan 23 '11 at 21:20
  • Correction: The MSVC++ library does [support](http://msdn.microsoft.com/en-us/library/0he30td8.aspx) UTF-16 and UTF-32 for the types `char16_t` and `char32_t`, that would essentially solve the issue for file I/O. – Philipp Jan 23 '11 at 22:25
  • @David: There's no good answer. What to use I guess depends on framework, platform, specific I/O requirements, etc. In general, if one must support non-BMP, char32_t and UTF-32 seems safer. – ThomasMcLeod Jan 23 '11 at 22:50
  • @Thomas No the question is what you use instead of wstring for UTF-16 – David Heffernan Jan 23 '11 at 22:53
  • @David, convert it to UTF-32, then use string. Or, in .Net use system.text.UTF32Encoding – ThomasMcLeod Jan 23 '11 at 23:04
  • @David, unless, of course, you can guarentee BMP, then there's no issue. – ThomasMcLeod Jan 23 '11 at 23:06
  • @thomas have you heard of surrogate pairs? UTF-16 is designed to be used with 16 code units. Outside BMP is fine. Are you aware that UTF-16 can encode all Unicode code points? – David Heffernan Jan 23 '11 at 23:11
  • @David, yes I'm aware. The problem is that many APIs that use wstrings don't know the difference. They interpret surrogate pairs as two 16-bit codes points. But since the surrogate pairs are in the invalid range of the BMP, they are ignored. – ThomasMcLeod Jan 23 '11 at 23:21
  • 1
    @thomas that would be a criticism of the API but your original point is that wstring is no good for storing UTF-16. Anyway which APIs are you referring to. I'm curious to know which ones don't support Unicode. – David Heffernan Jan 23 '11 at 23:26
-6

This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?

Something like:

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}
dlchambers
  • 3,511
  • 3
  • 29
  • 34