3

I have some text files which are encoded using UTF-8. Is there a way to read them using c++ stream classes (wifstream for example)?

I have seen some external references like boost and some codeproject code snippets. But, I dont want to use that just for this purpose.

On linux it somehow works by calling imbue(std::locale("en_US")) but not on windows. I think the problem is that window assumes wifstream to be a UTF-16 encoded stream. Can't I specify the unicode encoding with wifstream class somehow so that it uses UTF-8 not UTF-16?

Aarkan
  • 3,811
  • 6
  • 40
  • 54
  • possible duplicate of [does (w)ifstream support different encodings](http://stackoverflow.com/questions/1274910/does-wifstream-support-different-encodings) – Flexo Mar 30 '12 at 16:17
  • 1
    Also related: http://stackoverflow.com/questions/7889032/utf-8-compliant-iostreams – Flexo Mar 30 '12 at 16:22
  • What do you use them for? You can always read UTF-8 files using `ifstream` if you subsequently treat the resulting buffers as UTF-8. The "wide" streams tend to be even less portable since `wchar_t` has different sizes in Windows and Linux. – Philipp Apr 01 '12 at 21:25
  • I am not going to read it as a whole buffer. Suppose I use extraction operator and read the stream character by character, what is going to happen? Will it read the characters correctly? I guess not. – Aarkan Apr 03 '12 at 04:12

2 Answers2

2

In addition to just reading the bytes from the file normally, and treating them as UTF-8 (e.g., by not passing them to anything that expects locale encoded strings, only to things that expect UTF-8), Windows has another way to read in UTF-8.

You can set a 'UTF-8' mode on file descriptors, and then use wide character input and output on that file descriptor and Microsoft's C runtime will handle transforming the wide characters to and from UTF-8 encoded byte streams:

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {
  _setmode(_fileno(stdout), _O_U8TEXT);
  wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
}

If you run the above program with output redirected to a file you will get a UTF-8 encoded file.

Setting one of these Unicode modes on a file descriptor has the additional effect on consoles that wide character output will actually work on the console. I'm not sure why exactly Microsoft chose "broken" as the default, but at least there's a way to enable a "not broken" mode.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • What about c++? I want to use wifstream class for my file io. – Aarkan Mar 31 '12 at 14:07
  • @Aarkan If one of the Unicode modes is enabled then you can use any of the standard c and c++ wchar_t IO routines on it, including wifstream methods. The trick there will be getting the file descriptor from the stream in order to set the mode in the first place. – bames53 Mar 31 '12 at 23:13
0

You can read utf8 files on windows perfectly normally - the only problem is when you want to do something with them.

Almost all Windows API calls use UTF16 or MBCS, you will need to convert UTF8-MBCS whenever you pass it to a Windows API - see Converting C-Strings from Local Encoding to UTF8

Community
  • 1
  • 1
Martin Beckett
  • 94,801
  • 28
  • 188
  • 263