5

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8. But when I write the data to the file the encoding is always ANSI. (I use Notepad++ to check this.)

What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.

line is a std::string
inputFile is an std::ifstream
pOutputFile is a FILE*

// ...

if( inputFile.is_open() )
{
    while( inputFile.good() )
    {
        getline(inputFile,line);

        //1
        DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
        wchar_t *pwcharText;
        pwcharText = new wchar_t[ dwCount];

        //2
        MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );

        //3
        dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
        char *pText;
        pText = new char[ dwCount ];

        //4
        WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );

        fprintf(pOutputFile,pText);
        fprintf(pOutputFile,"\n");

        delete[] pwcharText;
        delete[] pText;
    }
}

// ...

Unfortunately the encoding is still ANSI. I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte. However, this doesn't seem to work. What am I missing here?

I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.

moshbear
  • 3,282
  • 1
  • 19
  • 33
Exa
  • 4,020
  • 7
  • 43
  • 60
  • 2
    If you only write english characters to the file, notepad++ is correct in displaying ansi and that file would also be UTF-8 as all english letters in their ansi/ascii encoding are a valid subset of utf-8. – RedX Jul 25 '12 at 09:15
  • The file would be a CSV file containing English letters, numbers and some special characters ('/', ';', ':', ',', '.', '(', ')'). – Exa Jul 25 '12 at 09:19
  • Does your compiler have support for [std::codecvt_utf8](http://en.cppreference.com/w/cpp/locale/codecvt_utf8)? – Jesse Good Jul 25 '12 at 09:21
  • Yes, I think so, I'm using VS2010. – Exa Jul 25 '12 at 09:25
  • The "u8" prefix is not recognized. – Exa Jul 25 '12 at 09:27
  • 1
    if you won't be having any letters or other symbols other then those then don't worry. That is full ascii and so automatically utf-8. – RedX Jul 25 '12 at 09:32
  • 4
    If it's all pure ASCII (and therefore, automatically UTF-8 as well), you may want to write the UTF-8 Byte Order Mark (AKA BOM) into the file as the very first thing. – Alexey Frunze Jul 25 '12 at 09:38
  • 1
    I agree with Alexey, do you requirements allow a BOM or is that forbidden? Secondly you need to test it by outputting something other than english characters. Try 金 = Kanji for Gold, or золото = Russian for gold and see what Notepad++ says then. – Ben Jul 25 '12 at 09:56

4 Answers4

3

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (i.e. in C++11). The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.

std::wstring wFromFile = _T("teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;

It sets the ANSI encoded file to UTF-8 (checked in Notepad). Hope this is what you need.

SChepurin
  • 1,814
  • 25
  • 17
3

On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • FWIW, I don't know of an OS that does. Both Linux and macos seem to rely on the logic used by `file`, `uchardet` or `enca`, which is probably similar to the way most applications figure it out. To put it in the descriptive terms of `enca`, the determination is: "a mixture of parsing, statistical analysis, guessing and black magic". – Heath Raftery Sep 14 '22 at 05:37
0

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:

DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );  
if (dwCount == 0) continue;

std::vector<WCHAR> utf16Text(dwCount);  
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );  

dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );  
if (dwCount == 0) continue;

std::vector<CHAR> utf8Text(dwCount);  
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );  

fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);  
fprintf(pOutputFile, "\n");  
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
0

The type char has no clue of any encoding, all it can do is store 8 bits. Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding. A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more. The type wchar_t in contrast is in Windows always interpreted as UTF 16.

So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说:微笑!." The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA. Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.

Note that I have used the handy CA2W class instead of MultiByteToWideChar. Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.

#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>

int main(int argc, char** argv)
{
  std::fstream  afile;
  std::string line1A = u8"Confucius says: Smile. 孔子说:微笑! ";
  std::wstring line1W;

  afile.open("Test.txt", std::ios::out | std::ios::app);
  if (!afile.is_open())
        return 0;

  afile << "\n" << line1A;
  afile.close();

  afile.open("Test.txt", std::ios::in);
  std::getline(afile, line1A);
  line1W = CA2W(line1A.c_str(), CP_UTF8);
  MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
  MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
  afile.close();

  return 0;
}