2

I managed to export English text to a csv file and toimplement localization. Latin letters and words work fine for any language (e.g.: German) but my program cannot export Chinese/Korean words to the csv, instead showing weird characters:

What the non-english text look like

For reference, the English version looks like this:

The expected output

Here's the code I use to generate the file:

ofstream file(filename);
// file.imbue(locale(file.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>()));
file << outputListWaveform << "\n";

//this is the part get the header of each column
for (int x = 0; x < data.size(); ++x)
{
    file << get<0>(data.at(x));
    if (x != data.size() - 1)
        file << ",";
}
file << "\n";

for (int i = 0; i < get<1>(data.at(0)).size(); ++i)
{
    for (int j = 0; j < data.size(); ++j)
    {
        auto header = get<0>(data.at(j));
        auto dVal = get<1>(data.at(j));
        auto bVal = get<2>(data.at(j));

        file << ((header == FileOpConstants::BOST || header == FileOpConstants::EOST) ? bVal.at(i) : dVal.at(i));
        if (j != data.size() - 1)
            file << ",";
    }
    file << "\n";
}

file.close();

And here are the codes to import the CSV file back

 StepTable data;
    ifstream file(filename);

    if (!file.is_open())
    {
        string errMsg = "Could not open file: " + filename;
        throw runtime_error(errMsg);
    }

    string line, colname;
    if (file.good())
    {
        getline(file, line); // metadata (e.g. "Output List Waveform" as generated using Cyclops)
        getline(file, line); // column header
        stringstream ss(line);
        while (getline(ss, colname, ','))
        {
            data.push_back({colname, vector<double>{}, vector<bool>{}});
        }
    }

    // Get Column Values (row by row)
    while (getline(file, line))
    {
        int i = 0;
        auto val = Utility::split(line, ',');
        for (const auto& v : val)
        {
            auto header = get<0>(data.at(i));
            if (header == FileOpConstants::BOST || header == FileOpConstants::EOST)
            {
                get<2>(data.at(i)).push_back(stoi(v));
            }
            else if(header == VoltageHeader || header == CurrentHeader || header == PowerHeader || header == ResistanceHeader || header == TimeHeader )
            {
                get<1>(data.at(i)).push_back(stod(v));
            }
            else{
                return {};
            }
            i++;
        }
    }

    file.close();
    return data;

I tried this method ( https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0 )on my exported CSV file and it looks like this:

enter image description here

I am able to export and import the Chinese CSV file without any error. But what I would like to view the exported CSV file with the correct Chinese word without any extra steps.

Newbie
  • 45
  • 1
  • 8
  • From the image you gave us I think it has to do with some part of your code/the csv viewer not handling UTF-8 – Tzig Sep 03 '21 at 08:28
  • How did you import the CSV file into excel? You need to specify UTF-8 when importing, [here](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0) is a sample link I found. If it still does not work, please show a hex dump of the bytes that make up the first row of your CSV so we can tell figure out what the encoding is. – Botje Sep 03 '21 at 11:17
  • Whole problem is caused by incorrect handling of character encodings. Some time ago [I cook up this answer](https://stackoverflow.com/a/67819605/1387438) which explains how to do it for msvc (for gcc/clang should be similar or even easier). – Marek R Sep 03 '21 at 14:00
  • @MarekR i saw your active code page with "chcp 65001". Then how to do it in the C++ file ? – Newbie Sep 04 '21 at 07:37
  • @Newbie I'm using `chcp xxxx` just show how settings of `cmd` impacts application and observed output. Note what happens when `chcp 1250` is used `encodings.exe .65001 .1250`. Since this is polish code page those characters are printed correctly, Wester Europe characters are converted to closest equivalent. Chinese and Korean are represented by questioning mark since it is impossible to represent this in this encoding. – Marek R Sep 05 '21 at 17:25

1 Answers1

2

Microsoft products are notorious for using BOM in UTF-8 (which was initially invalid as by the Unicode specs, but due to widespread use in practice, is now allowed, but not recommended).

Excel uses it to determine the encoding of CSVs when you open them (e.g. by double click). If there is no BOM, it uses a locale 8-bit encoding (probably cp1252).

To enforce it to read as UTF-8, write a BOM, like this:

ofstream file(filename);
file << char(0xEF) << char(0xBB) << char(0xBF);

You will have to deal with it when reading your files back

king_nak
  • 11,313
  • 33
  • 58