1

I have a JSON file with the following content (for example):

{
    "excel_filepath": "excel_file.xlsx",
    "line_length": 5.0,
    "record_frequency": 2.5,
    "report_file_name": "\u041f\u0421 \u041f\u0440\u043e\u043c\u0437\u043e\u043d\u0430 - \u041f\u0421 \u041f\u043e\u0433\u043e\u0440\u0435\u043b\u043e\u0432\u043e (\u0426.1)",
    "line_type": 1,
}

This JSON file is generated by Python script.

For reading the JSON file, I use the <nlohmann/json.hpp> library (I found it simple for my case):

using json = nlohmann::json;

std::ifstream f("temp_data.json");
json data = json::parse(f);

What I want to do is to read the "report_file_name" value and create a simple .txt file named as the value of the report_file_name key, which is stored as Unicode, as you can see.

What I am trying to do is as follows:

_setmode(_fileno(stdout), _O_U16TEXT);
const locale utf8_locale = locale(locale(), new codecvt_utf8<wchar_t>());

string report_file_name = data["report_file_name"];
    
for (auto unicode_char : report_file_name) 
{
    wcout << typeid(unicode_char).name() << ": " << unicode_char << endl;
}

wofstream report_file(report_file_name + L".txt");
report_file.imbue(utf8_locale);

This gives an output as:

char: Ð  
char:  
char: Ð  
char: ¡  
char:  
char: Ð  
char:  
char: Ñ  
char:  
char: Ð  
char: ¾
... and so on

I have to note that I somehow managed to write Cyrillic letters into a report file. Interestingly, when I do:

wcout << L"\u041f\u0421" << endl;

It prints out Cyrillic letters (ПС) correctly. Also, no problem with creating the report .txt file with a Cyrillic name from code:

wofstream report_file(L"Отчет.txt"); // fine!

Am I doing something wrong? I'm using Windows 10, MVS 2022 with C++17 Standard. If this is helpful.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 2
    nlohmann::json has decoded the `\uXXXX` escapes into unicode characters and then encoded everything to UTF-8 so it can fit in a `std::string`. If you want to construct a UTF-16 filename you need to convert from UTF-8 first. I don't even know which overload of `std::string` your `report_file_name + L".txt"` expression reaches... – Botje Mar 15 '23 at 14:36
  • It is my bad, actually. I have no solid reason to create .txt file with "report_file_name + L".txt"". I guess it is possible to create just by report_file_name + ".txt". As far as I got you. – Yerlan Amir Mar 15 '23 at 14:51
  • That very much depends on what your OS does when presented a UTF-8 filename in a call to `open`. See [this answer](https://stackoverflow.com/a/26803325/1548468). – Botje Mar 15 '23 at 14:56

1 Answers1

0

Per nlohmann::json's documentation:

https://github.com/nlohmann/json#character-encoding

Character encoding

The library supports Unicode input as follows:

  • Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259.
  • std::u16string and std::u32string can be parsed, assuming UTF-16 and UTF-32 encoding, respectively. These encodings are not supported when reading from files or other input containers.
  • Other encodings such as Latin-1 or ISO 8859-1 are not supported and will yield parse or serialization errors.
  • Unicode noncharacters will not be replaced by the library.
  • Invalid surrogates (e.g., incomplete pairs such as \uDEAD) will yield parse errors.
  • The strings stored in the library are UTF-8 encoded. When using the default string type (std::string), note that its length/size functions return the number of stored bytes rather than the number of characters or glyphs.
  • When you store strings with different encodings in the library, calling dump() may throw an exception unless json::error_handler_t::replace or json::error_handler_t::ignore are used as error handlers.
  • To store wide strings (e.g., std::wstring), you need to convert them to a UTF-8 encoded std::string before, see an example.

So, in your case, your report_file_name string is a UTF-8 encoded std::string, which you will need to decode into a std::wstring (UTF-16 on Windows, UTF-32 on other platforms) before you can use it with std::wofstream, eg:

std::wstring utf8_to_wstr(const std::string &uf8)
{
    // there are many questions on StackOverflow about how to do this conversion.
    // You can use the Win32 MultiByteToWideChar() API, or std::wstring_convert
    // with std::std::codecvt_utf8/_utf16, or a 3rd party Unicode library such as
    // ICU or iconv...
}

...

wstring report_file_name = utf8_to_wstr(data["report_file_name"]);
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770