0
std::string path("path.txt");
std::fstream f(path);
f.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::string lcpath;
f >> lcpath;

Reading a utf-8 text from path.txt on windows fails with MSVC compiler on windows in the sense lcpath does not understand the path as utf-8.

The below code works correctly on linux when compiled with g++.

    std::string path("path.txt");
    std::fstream ff;
    ff.open(path.c_str());
    std::string lcpath;
    ff>>lcpath;

Is fstream on windows(MSVC) by default assume ascii only?

In the first snippet if I change string with wstring and fstream with wfstream, lcpath gets correct value on windows as well.

EDIT: If I convert the read lcpath using MultiByteToWideChar(), I get the correct representation. But why can't I directly read a UTF-8 string into std::string on windows?

user3819404
  • 611
  • 6
  • 18
  • It's not clear what you're asking. Is there some operation you're attempting on `lcpath` that isn't working? – Mark Ransom Dec 05 '19 at 19:00
  • @MarkRansom It does not represent the correct text like if I enter something like `°§èé€`, `lcpath` reflects some other gibberish which looks like it is treating the read characters as ASCII. – user3819404 Dec 05 '19 at 19:06
  • Probably not ASCII then, but another 8-bit code page. The problem is that whatever you're using to display `lcpath` isn't treating it as UTF-8. Since UTF-8 is very rare on Windows there's very little built-in support for it. – Mark Ransom Dec 05 '19 at 19:21
  • To be precise `°§èé€.txt` is seen as `°§èé€.txt`. That looks like ansi since in notepad++ if I enter the utf-8 text and then change the encoding to ansi, I get the latter string. – user3819404 Dec 05 '19 at 19:24
  • Actually Ansi is a bit unspecific. I'm guessing it's actually cp1252, which is the default for a US copy of Windows. And again this is just the way it's being displayed, the byte values can be proper UTF-8 and you'd never know it. – Mark Ransom Dec 05 '19 at 19:33
  • Read https://stackoverflow.com/q/3298569/5987 and see if it sheds any light. – Mark Ransom Dec 05 '19 at 19:35
  • 2
    imbue() will silently fail if you have opened the file. You must imbue it before opening. – Martin York Dec 05 '19 at 19:38
  • Check if you set the [`_MBCS` macro](https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=vs-2019). – João Paulo Dec 05 '19 at 19:55
  • @JoãoPaulo yes it is set. Moreover I compiled the same program using g++ on windows and it also is unable to understand utf-8. – user3819404 Dec 06 '19 at 04:42

1 Answers1

1

Imbuing an opened file can be problamatic:

http://www.cplusplus.com/reference/fstream/filebuf/imbue/

If loc is not the same locale as currently used by the file stream buffer, either the internal position pointer points to the beginning of the file, or its encoding is not state-dependent. Otherwise, it causes undefined behavior.

The problem here is that when a file is opened and the file has a BOM marker in it this will usually be read from the file by the currently installed local. Thus the position pointer is no longer at the beginning of the file and we have undefined behavior.

To make sure your local is set correctly you must do it before opening the file.

std::fstream f;
f.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

std::string path("path.txt");
f.open(path);

std::string lcpath;
f >> lcpath;
Martin York
  • 257,169
  • 86
  • 333
  • 562
  • I tried opening file after `imbue`, but still the same issue. It is unable to reflect it in utf-8. If everything is wstring and wfstream, then it is able to represent it correctly. – user3819404 Dec 06 '19 at 03:06