18

Consider the following code:

#include <iostream>
#include <boost\locale.hpp>
#include <Windows.h>
#include <fstream>

std::string ToUtf8(std::wstring str)
{
    std::string ret;
    int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
    if (len > 0)
    {
        ret.resize(len);
        WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
    }
    return ret;
}

int main()
{
    std::wstring wfilename = L"D://Private//Test//एउटा फोल्दर//भित्रको फाईल.txt";
    std::string utf8path = ToUtf8(wfilename );
    std::ifstream iFileStream(utf8path , std::ifstream::in | std::ifstream::binary);
    if(iFileStream.is_open())
    {
        std::cout << "Opened the File\n";
        //Do the work here.
    }
    else
    {
        std::cout << "Cannot Opened the file\n";

    }
    return 0;

}

If I am running the file, I cannot open the file thus entering into the else block. Even using boost::locale::conv::from_utf(utf8path ,"utf_8") instead of utf8path doesn't work. The code works if I consider using wifstream and using wfilename as its parameter, but I don' want to use wifstream. Is there any way to open the file with its name utf8 encoded? I am using Visual Studio 2010.

roalz
  • 2,699
  • 3
  • 25
  • 42
Mahadeva
  • 1,584
  • 4
  • 23
  • 56
  • 3
    None of the underlying Windows APIs use UTF8. std::ifstream will eventually call CreateFileA or CreateFileW to open the file, nether of these functions take UTF8. – Richard Critten Jun 14 '15 at 12:49
  • So If I am going to use `ifstream` how should I change the code to make it work. Should I be using `wstring` – Mahadeva Jun 14 '15 at 12:55
  • The thing is that I am trying to make the code cross platform. Since Linux is already unicode aware, the code should probably work if I use `ifstream`. How should I tackle this situation? – Mahadeva Jun 14 '15 at 13:04
  • This depends on your standard library implementation. One that I'm familiar with , it is actually impossible, you can't use iostreams with files that might have non-8bit filenames. – M.M Jun 14 '15 at 14:00
  • So is my only option is to use `ifdefs` and use `wstring` for windows and `string` for Linux OS? Any other way exists? – Mahadeva Jun 14 '15 at 15:10

2 Answers2

33

On Windows, you MUST use 8bit ANSI (and it must match the user's locale) or UTF-16 for filenames, there is no other option available. You can keep using string and UTF-8 in your main code, but you will have to convert UTF-8 filenames to UTF-16 when you are opening files. Less efficient, but that is what you need to do.

Fortunately, VC++'s implementation of std::ifstream and std::ofstream have non-standard overloads of their constructors and open() methods to accept wchar_t* strings for UTF-16 filenames.

explicit basic_ifstream(
    const wchar_t *_Filename,
    ios_base::openmode _Mode = ios_base::in,
    int _Prot = (int)ios_base::_Openprot
);

void open(
    const wchar_t *_Filename,
    ios_base::openmode _Mode = ios_base::in,
    int _Prot = (int)ios_base::_Openprot
);
void open(
    const wchar_t *_Filename,
    ios_base::openmode _Mode
);
explicit basic_ofstream(
    const wchar_t *_Filename,
    ios_base::openmode _Mode = ios_base::out,
    int _Prot = (int)ios_base::_Openprot
);

void open(
    const wchar_t *_Filename,
    ios_base::openmode _Mode = ios_base::out,
    int _Prot = (int)ios_base::_Openprot
);
void open(
    const wchar_t *_Filename,
    ios_base::openmode _Mode
);

You will have to use an #ifdef to detect Windows compilation (unfortunately, different C++ compilers identify that differently) and temporarily convert your UTF-8 string to UTF-16 when opening a file.

#ifdef _MSC_VER
std::wstring ToUtf16(std::string str)
{
    std::wstring ret;
    int len = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
    if (len > 0)
    {
        ret.resize(len);
        MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len);
    }
    return ret;
}
#endif

int main()
{
    std::string utf8path = ...;
    std::ifstream iFileStream(
        #ifdef _MSC_VER
        ToUtf16(utf8path).c_str()
        #else
        utf8path.c_str()
        #endif
        , std::ifstream::in | std::ifstream::binary);
    ...
    return 0;
}

Note that this is only guaranteed to work in VC++. Other C++ compilers for Windows are not guaranteed to provide similar extensions.

UPDATE: as of Windows 10 Insider Preview Build 17035, Microsoft now supports UTF-8 as a system-wide encoding that users can set their locale to. And as of Windows 10 Version 1903 (build 18362), applications can now opt in via their app manifest to use UTF-8 as a process-wide codepage, even if the user locale is not set to UTF-8. These features allow ANSI-based APIs (like CreateFileA(), which std::ifstream/std::ofstream use internally) to work with UTF-8 strings. So, in theory, with this feature turned on, you might be able to pass a UTF-8 encoded string to std::ifstream/std::ofstream and it would "just work". I can't confirm that, as it very much depends on the implementation. It would be safer to stick with passing in UTF-16 filenames, since that is Windows' native encoding, which the ANSI APIs will simply convert to internally.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • +1 this worked. For those who want to convert `utf8` to `utf16`, there is another function too which is available [here](http://stackoverflow.com/a/7154226/2634612). – Mahadeva Jun 14 '15 at 16:25
  • 3
    There are many UTF conversion implementations available. Manual implementations (like the one you linked to), Unicode libraries like libiconv and ICU, and even `std::codecvt_utf8_utf16` in C++11. – Remy Lebeau Jun 14 '15 at 16:30
  • Instead of putting `#ifdef` inside every file open, you can create a function `filename(const std::string &fname)` and put all the yucky stuff in one place. Then you just use that function on the filename wherever you need to open a file. – Mark Ransom Aug 17 '16 at 21:11
  • What is "8 bit ANSI"? Do you mean ASCII? That is a 7 bit encoding, sometimes often placed in a 8 bit bytes by setting the MSB to zero. Is that what you mean? – Raedwald Dec 09 '17 at 17:40
  • 2
    @Raedwald no, I really meant 8bit ANSI. Unicode strings not encoded in a UTF require an 8-bit encoding, such as Windows-1252, etc (7bit ASCII is a subset of UTF-8). On Windows, user locales are implemented using [code pages](https://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx) that implement these encodings. So, a filename on a Windows system must be encoded in either UTF-16 or the user's default ANSI codepage. – Remy Lebeau Dec 09 '17 at 19:54
  • "The phrase ANSI character set has no well-defined meaning." https://en.wikipedia.org/wiki/ANSI_character_set – Eike Nov 29 '19 at 10:01
  • Thanks for your answer. It really helped! When you say "On Windows, you MUST use 8bit ANSI". Why is that? There is no plan for Windows to support UTF8? Is the problem due to "Windows" or to "Visual Studio"? Meaning that mingw targetting Windows, also has to use this `ToUtf16` implementation? – jpo38 Oct 28 '20 at 09:42
  • @jpo38 "*There is no plan for Windows to support UTF8?*" - at the time that I wrote this answer, there wasn't yet, no. Microsoft recently added a new *experimental* feature in Windows 10 to add support for UTF8 in ANSI-based APIs. But apps have to opt in to the new feature, and I don't know if that feature extends into implementations of `std::ifstream` or not. So best to stick with UTF16 until someone tests and confirms it works with UTF8. "*Is the problem due to "Windows" or to "Visual Studio"?*" - Windows itself, at the level where a path string is given to the file system API. – Remy Lebeau Oct 28 '20 at 14:41
  • Ok, thank you for your reply. It's not that bad to do the utf16/utf8 conversion when needed. – jpo38 Oct 28 '20 at 20:10
  • Thanks! It's good to see Microsoft tries to support that standard in the end! – jpo38 Oct 28 '20 at 21:13
  • Why can't `ToUtf16` can't use STL standard functions and simply be `std::wstring_convert> converter; ret = converter.from_bytes(str);` Do we really need to use Window API here? – jpo38 Nov 02 '20 at 08:11
  • 1
    @jpo38 you can use whatever you want to implement `ToUtf16()`. Plenty of Unicode APIs to choose from. `wstring_convert()` will work, but note that it has been deprecated in C++17, with no standard replacement defined yet. – Remy Lebeau Nov 02 '20 at 08:20
3

You can use std::filesystem::u8path in C++14/17:

std::filesystem::path pa = std::filesystem::u8path((const char*)yourStdStringPath.c_str());
std::ofstream ofs(pa);

It's deprecated in C++20 since you can use the u8 prefix.

Kaaf
  • 350
  • 5
  • 10