7

I want to write a std::wstring onto a file and need to read that content as std:wstring. This is happening as expected when the string as L"<Any English letter>". But the problem is happening when we have character like Bengali, Kannada, Japanese etc, any kind of non English letter. Tried various options like:

  1. Converting the std::wstring to std::string and write onto the file and reading time read as std::string and convert as std::wstring
    • Writing is happening (I could see from edito) but reading time getting wrong character
  2. Writing std::wstring onto wofstream, this is also not helping for native language character letters like std::wstring data = L"হ্যালো ওয়ার্ল্ড";

Platform is mac and Linux, Language is C++

Code:

bool
write_file(
    const char*         path,
    const std::wstring  data
) {
    bool status = false;
    try {
        std::wofstream file(path, std::ios::out|std::ios::trunc|std::ios::binary);
        if (file.is_open()) {
            //std::string data_str = convert_wstring_to_string(data);
            file.write(data.c_str(), (std::streamsize)data.size());
            file.close();
            status = true;
        }
    } catch (...) {
        std::cout<<"exception !"<<std::endl;
    }
    return status;
}


// Read Method

std::wstring
read_file(
    const char*  filename
) {
    std::wifstream fhandle(filename, std::ios::in | std::ios::binary);
    if (fhandle) {
        std::wstring contents;
        fhandle.seekg(0, std::ios::end);
        contents.resize((int)fhandle.tellg());
        fhandle.seekg(0, std::ios::beg);
        fhandle.read(&contents[0], contents.size());
        fhandle.close();
        return(contents);
    }
    else {
        return L"";
    }
}

// Main

int main()
{
  const char* file_path_1 = "./file_content_1.txt";
  const char* file_path_2 = "./file_content_2.txt";

  //std::wstring data = L"Text message to write onto the file\n";  // This is happening as expected
  std::wstring data = L"হ্যালো ওয়ার্ল্ড";
// Not happening as expected.

  // Lets write some data
  write_file(file_path_1, data);
 // Lets read the file
 std::wstring out = read_file(file_path_1);

 std::wcout<<L"File Content: "<<out<<std::endl;
 // Let write that same data onto the different file
 write_file(file_path_2, out);
 return 0;
}
ST3
  • 8,826
  • 3
  • 68
  • 92
  • 2
    Use `std::wifstream` and `std::wofstream` (or `std::wfstream`), then you can use `std::wstring` directly. – Some programmer dude Aug 02 '13 at 08:24
  • @JoachimPileborg, I wrote the above sample code but this is not working as expected when the string contains any no English character... like std::wstring data = L"হ্যালো ওয়ার্ল্ড"; etc.. – Abhrajyoti Kirtania Aug 02 '13 at 08:28
  • 1
    Unrelated, but why do you open the file in binary mode if you're only reading/writing text? Also, when writing you don't have to `flush` the file as that will be done by closing it. – Some programmer dude Aug 02 '13 at 08:35
  • @JoachimPileborg He does (but that may be the result of an edit after your comments). But in most implementations (and what I would expect) is that the locales `"C"` (the default) and `"Posix"` will only map codes corresponding to ASCII characters. – James Kanze Aug 02 '13 at 08:35

5 Answers5

3

How a wchar_t is output depends on the locale. The default locale ("C") generally doesn't accept anything but ASCII (Unicode code points 0x20...0x7E, plus a few control characters.)

Any time a program handles text, the very first statement in main should be:

std::locale::global( std::locale( "" ) );

If the program uses any of the standard stream objects, the code should also imbue them with the global locale, before any input or output.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • When i have added std::locale::global( std::locale( "" ) ); in main.. getting exception as libc++abi.dylib: terminate called throwing an exception Abort trap: 6 – Abhrajyoti Kirtania Aug 02 '13 at 08:48
  • @AbhrajyotiKirtania That's strange, because most of my C++ programs start this way (and it is required to "work" by the C++ standard, although the implementation gets to define what it means by "work"). What's your environment? (And if it's Unix based, what are `$LANG` and the `$LC_...` set to?) – James Kanze Aug 02 '13 at 08:55
  • I am trying on Unix based system... How does this $LANG set makes diff? – Abhrajyoti Kirtania Aug 02 '13 at 09:05
  • `$LANG` determines the locale used by `std::locale( "" )`. Under Unix, passing an empty string as the name of the locale causes (or should cause) the implementation to construct a locale based on `$LANG` and the `$LC_...` environment variables. `std::locale::global( std::locale( "" ) );` is the C++ equivalent of `setlocale( LC_ALL, "" )`, as defined by Posix (in http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html). If it doesn't work, and your `$LANG` and `$LC_...` are set reasonably, then this is a serious bug in the g++ libraries. – James Kanze Aug 02 '13 at 09:27
  • I would really recommend against doing internationalization by relying on the system's locale. – bames53 Aug 02 '13 at 17:18
  • @JamesKanze "That's strange, because most of my C++ programs start this way" libstdc++ on OS X doesn't implement proper locale support so it won't work except with the "C" locale. It will consider the normal system locale names to be invalid. libc++ has proper locale support though. – bames53 Aug 02 '13 at 17:20
  • I should clarify; using the system locale may be fine for things like getting default punctuation, formats, etc., but encodings should never depend on locales. – bames53 Aug 02 '13 at 17:48
0

To read and write unicode files (assuming you want to write unicode characters) you can try fopen_s

FILE *file;

if((fopen_s(&file, file_path, "w,ccs=UNICODE" )) == NULL)
{
    fputws(your_wstring().c_str(), file);
}
mag_zbc
  • 6,801
  • 14
  • 40
  • 62
0

Later edit: this is for Windows (since no tag was present at the time of the answer)

You need to set the stream to a locale that supports those characters . Try something like this (for UTF8/UTF16):

std::wofstream myFile("out.txt"); // writing to this file 
myFile.imbue(std::locale(myFile.getloc(), new std::codecvt_utf8_utf16<wchar_t>));

And when you read from that file you have to do the same thing:

std::wifstream myFile2("out.txt"); // reading from this file
myFile2.imbue(std::locale(myFile2.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
Iosif Murariu
  • 2,019
  • 1
  • 21
  • 27
  • If he wants UTF-8. If he's on Windows, he probably wants UTF-16LE. And in any environment, he wants the user to decide. (But this gets tricky when reading, since files from different sources may be encoded differently.) – James Kanze Aug 02 '13 at 08:37
  • Also, `std::wofstream` will start with the default global locale. If he's set this correctly at the start of `main`, he doesn't have to `imbue` anything. – James Kanze Aug 02 '13 at 08:38
  • Yes, of course. I assumed that his characters are UTF-8, so that's why I used UTF-8. I'll edit my answer :) – Iosif Murariu Aug 02 '13 at 08:39
  • @JamesKanze, yes I know that too (about `wofstream`), but if he's got a similar setup like mine (Windows w/ English although I'm not English) he may have to do this. But again, you're right. – Iosif Murariu Aug 02 '13 at 08:44
  • Aha. I'm not really familiar with how Windows handles locales (I've only worked with Windows in an English speaking environment); I would _expect_ that even under English Windows, there would be some way of setting the locale via environment variables, which should be picked up with `locale( "" )`. But given the way most Windows users work, I rather doubt that they'd be using it. (For those unfamiliar with Unix: there are no national versions for Unix. Instead, each user sets environment variables stating what language, etc. he wants to use.) – James Kanze Aug 02 '13 at 08:51
  • @JamesKanze and all, I am looking somethings from Mac/Linux – Abhrajyoti Kirtania Aug 02 '13 at 09:02
  • well... there go all my ideas :) – Iosif Murariu Aug 02 '13 at 09:15
  • @IosifM. Your solution should also work under Linux or Mac. Except that since `wchar_t` is UTF-32 on these platforms, the codecvt you need should be `std::codecvt_utf8_utf32`. _And_... the standard Unicode file on these systems is UTF-8, so that's almost certainly what he wants. (Of course, these codecvt are only guaranteed to be present in C++11. And I have no idea whether current versions of g++ support this part of C++11---historically, g++ has been very behind VC++ in things concerning i18n.) – James Kanze Aug 02 '13 at 09:21
  • @JamesKanze the only problem is that `std::codecvt_utf8_utf32` doesn't exist afaik. **However** he should try this code on his machine and see if it works. If it does - yay, if it doesn't - boo – Iosif Murariu Aug 02 '13 at 09:27
  • @IosifM. You're right about `std::codecvt_utf8_utf32`. It should be just `std::codecvt_utf8`. (In your case as well, I think. But on a system where `wchar_t` is UTF-16, I think both will be the saame when instantiated over `wchar_t`.) – James Kanze Aug 02 '13 at 09:39
  • @JamesKanze `codecvt_utf8` with 16-bit `wchar_t` will only support characters in the BMP, not the full Unicode range. To support UTF-16 `wchar_t` you must use `codecvt_utf8_utf16`. – bames53 Aug 02 '13 at 17:52
  • @bames53 That's not what the standard says. (But as you point out, compliance with this particular section of the standard has been a weak point of many compilers.) – James Kanze Aug 02 '13 at 18:16
  • @JamesKanze It does say that; 22.5/4 "For the facet codecvt_utf8 The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem)". UCS2 uses 16-bit code units and does not support characters outside the BMP. – bames53 Aug 02 '13 at 18:23
0

One possible problem may be when you read the string back, because you set the length of the string to the number of bytes in the file and not the number of characters. This means that you attempt to read past the end of the file, and also that the string will contain trash at the end.

If you're dealing with text files, why not simply use the normal output and input operators << and >> or other textual functions like std::getline?

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • I'd think just the reverse. The number of bytes will never be less than the number of characters, but it could be signifiantly more. – James Kanze Aug 02 '13 at 08:39
  • Re your edit: and input and output into `std::wstring`, so he doesn't have to worry about the size anywhere. – James Kanze Aug 02 '13 at 08:40
  • 1
    And with regards to your initial comment: while I doubt that this is the problem here, it _is_ a very valid point. When reading, you must _always_ verify that the read succeeded before using the data. And `std::wistream::read` is a bit special, since it will set the `failbit` even when it succeeds in reading some (but not all) characters; the failure condition is `!stream && stream.gcount() == 0`. If `stream.gcount() != 0`, you've successfully read that many characters. – James Kanze Aug 02 '13 at 08:46
0

Do not use wstring or wchar_t. On non-Windows platforms wchar_t is pretty much worthless these days.

Instead you should use UTF-8.

bool
write_file(
    const char*         path,
    const std::string   data
) {
    try {
        std::ofstream file(path, std::ios::out | std::ios::trunc | std::ios::binary);
        file.exceptions(true);
        file << data;
        return true;
    } catch (...) {
        std::cout << "exception!\n";
        return false;
    }
}


// Read Method

std::string
read_file(
    const char*  filename
) {
    std::ifstream fhandle(filename, std::ios::in | std::ios::binary);

    if (fhandle) {
        std::string contents;
        fhandle.seekg(0, std::ios::end);
        contents.resize(fhandle.tellg());
        fhandle.seekg(0, std::ios::beg);
        fhandle.read(&contents[0], contents.size());
        return contents;
    } else {
        return "";
    }
}

int main()
{
  const char* file_path_1 = "./file_content_1.txt";
  const char* file_path_2 = "./file_content_2.txt";

  std::string data = "হ্যালো ওয়ার্ল্ড"; // linux and os x compilers use UTF-8 as the default execution encoding.

  write_file(file_path_1, data);
  std::string out = read_file(file_path_1);

  std::wcout << "File Content: " << out << '\n';
  write_file(file_path_2, out);
}
Community
  • 1
  • 1
bames53
  • 86,085
  • 15
  • 179
  • 244