3

I can use ofstream to write to UTF-8 BOM file. I can also write Unicode string to file using wofstream and imbue with utf8_locale (codecvt_utf8). However, I cannot find out how to write Unicode string to file with UTF-8 BOM encoding.

Ajay
  • 18,086
  • 12
  • 59
  • 105
Alex Huynh
  • 384
  • 3
  • 11

2 Answers2

3

BOM is just first optional bytes at the beginning of the file to specify its encoding. it has nothing to do directly to std::fstream as fstream is just a file stream for reading and writing random bytes/characters.

you just need to manually write the BOM before you continue writing your utf8 encoded string.

unsigned uint8_t utf8BOM[] = {0xEF,0xBB,0xBF}; 
fileStream.write(utf8BOM,sizeof(utf8BOM));
//write the rest of the utf8 encoded string..
David Haim
  • 25,446
  • 3
  • 44
  • 78
  • 1
    Or if you're using a wide stream with the locale doing the UTF-8 encoding then it's just character `U+FEFF` – Steve Jessop Jun 02 '16 at 12:11
  • @SteveJessop UTF-16 Big Endian : `FE FF` Little Endian `FF FE` –  Jun 02 '16 at 12:19
  • 3
    @Dieter that's the byte sequence. The unicode code point is (regardless of endianness) `U+FEFF` – rubenvb Jun 02 '16 at 12:22
  • fstream can write BOM to file but cannot write unicode string (e.g. "日本医療政策機構" or "Phở") as I mentioned in my question. – Alex Huynh Jun 03 '16 at 01:35
  • 1
    FYI: you can also get the UTF-8 BOM with C++11 compilers by using `const char utf8Bom[] = u8"\uFEFF"` – Nicol Bolas Jun 03 '16 at 04:52
  • To address @AlexHuynh's point as having the same problem, following from SteveJessop and rubenvb, when opening a std::wofstream ofs, I achieved success with "ofs << L"\FEFF";". – David Carr Jan 08 '23 at 01:29
3

The example below works fine in VS 2015 or new gcc compilers:

#include <iostream>
#include <string>
#include <fstream>
#include <codecvt>

int main()
{
    std::string utf8 = u8"日本医療政策機構\nPhở\n";
    std::ofstream f("c:\\test\\ut8.txt");

    unsigned char bom[] = { 0xEF,0xBB,0xBF };
    f.write((char*)bom, sizeof(bom));

    f << utf8;
    return 0;
}

In older versions of Visual Studio you have to declare UTF16 string (with L prefix), then convert from UTF16 to UTF8:

#include <iostream>
#include <string>
#include <fstream>
#include <Windows.h>

std::string get_utf8(const std::wstring &wstr)
{
    if (wstr.empty()) return std::string();
    int sz = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), 0, 0, 0, 0);
    std::string res(sz, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &res[0], sz, 0, 0);
    return res;
}

std::wstring get_utf16(const std::string &str)
{
    if (str.empty()) return std::wstring();
    int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), 0, 0);
    std::wstring res(sz, 0);
    MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &res[0], sz);
    return res;
}

int main()
{
    std::string utf8 = get_utf8(L"日本医療政策機構\nPhở\n");

    std::ofstream f("c:\\test\\ut8.txt");

    unsigned char bom[] = { 0xEF,0xBB,0xBF };
    f.write((char*)bom, sizeof(bom));

    f << utf8;
    return 0;
}
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • Thanks Barmak. I am using Visual Studio 2013 and get error in "u8" literal because VS2013 cannot understand it. I know it worked on VS2015 but I want to do it on VS2013. – Alex Huynh Jun 03 '16 at 04:34
  • I don't remember VS2013 capabilities. See the updated code, it should work for older compilers. – Barmak Shemirani Jun 03 '16 at 04:35