7

I'm currently writing a C++ program that's rather math-involved. As such, I'm trying to denote some objects as having subscript numbers in a wstring member variable of their class. However, attempts at storing these characters in any capacity forces them into their non-subscript counterparts. By contrast, direct uses of the characters that are pasted in the code are maintained as desired. Here are several cases I experimented with:

setlocale(LC_ALL, "");
wchar_t txt = L'\u2080';
wcout << txt << endl;
myfile << txt << endl;

This outputs "0" to both the file and console.

setlocale(LC_ALL, "");
wcout << L"x₀₁" << endl;
myfile << L"x₀₁" << endl;

This outputs "x01" to both the file and console.

setlocale(LC_ALL, "");
wcout << "x₀₁" << endl;
myfile << "x₀₁" << endl;

This outputs "xâ'?â'?" to the console, which I'd like to avoid if possible, and "x₀₁" to the file which is what I want. An ideal program state would be one that property outputs to both the file and to console, but if that’s not possible, then printing non-subscript characters to the console is preferable.

My code intends to convert ints into their corresponding subscripts. How do I manipulate these characters as smoothly as possible without them getting converted back? I suspect that character encoding plays a part, but I do not know how to incorporate Unicode encoding into my program.

  • 3
    Character/string encoding and formatting problems is not an easy thing to solve. Have fun. – Jesper Juhl Feb 04 '20 at 19:59
  • It's risky to place non-ASCII characters within double quotes in C++ source code. What you may be able to conjure up on your keyboard and stick between double quotes may not be what you get at runtime. – PaulMcKenzie Feb 04 '20 at 20:29
  • If using VStudio, Unicode is an option for the `Character Set` field in your Project Settings. Another possibility is whatever terminal you use does not support unicode, which would make it impossible for you to get the desired result. Possibly this question is a [duplicate](https://stackoverflow.com/questions/12015571/how-to-print-unicode-character-in-c) – alteredinstance Feb 04 '20 at 20:33
  • On my system (Linux, compiling with clang9) it works, as it does on [godbolt](https://godbolt.org/z/gpKfzb). Note that it does not look great on godbolt but that is just the subscript 0 as it looks there (I added a superscript 1, that looks more natural). So I think it depends very heavily on your system. – n314159 Feb 04 '20 at 20:49
  • @alteredinstance I'm using VS Code. I did a quick search through settings and found one for file encoding, but the only unicode encodings were utf8 (selected), utf8bom, utf16le, and utf16be. I believe the characters I'm interested in are in utf32. – Michael Luger Feb 05 '20 at 06:05
  • 1
    Try using UTF16BE (Big Endian). It appears the subscript characters are UTF-16, and not 32. My [reference](https://www.fileformat.info/info/unicode/char/2080/index.htm) – alteredinstance Feb 05 '20 at 15:07
  • Turns out the setting in question relates to reading source files, and so doesn't see to be of much help. – Michael Luger Feb 05 '20 at 19:55
  • Just remember that it's not just the correct encoding, but also whether the font actually contains the required glyph. The old Windows console by severely doesn't support a lot of although I think newer Windows versions switched to a better font. – Voo Feb 06 '20 at 21:15

2 Answers2

2

I find these things tricky and I'm never sure if it works for everyone on every Windows version and locale, but this does the trick for me:

#include <Windows.h>
#include <io.h>     // _setmode
#include <fcntl.h>  // _O_U16TEXT

#include <clocale>  // std::setlocale 
#include <iostream>

// Unicode UTF-16, little endian byte order (BMP of ISO 10646)
constexpr char CP_UTF_16LE[] = ".1200";

constexpr wchar_t superscript(int v) {
    constexpr wchar_t offset = 0x2070;       // superscript zero as offset
    if (v == 1) return 0x00B9;               // special case
    if (v == 2 || v == 3) return 0x00B0 + v; // special case 2
    return offset + v;
}

constexpr wchar_t subscript(int v) {
    constexpr wchar_t offset = 0x2080; // subscript zero as offset
    return offset + v;
}

int main() {
    // set these before doing any other output:
    setlocale(LC_ALL, CP_UTF_16LE);
    _setmode(_fileno(stdout), _O_U16TEXT);

    // subscript
    for (int i = 0; i < 10; ++i)
        std::wcout << L'X' << subscript(i) << L' ';
    std::wcout << L'\n';

    // superscript
    for (int i = 0; i < 10; ++i)
        std::wcout << L'X' << superscript(i) << L' ';
    std::wcout << L'\n';    
}

Output:

X₀ X₁ X₂ X₃ X₄ X₅ X₆ X₇ X₈ X₉
X⁰ X¹ X² X³ X⁴ X⁵ X⁶ X⁷ X⁸ X⁹

A more convenient way may be to create wstrings directly. Here wsup and wsub takes a wstring and returns a converted wstring. Characters they can't handle are left unchanged.

#include <Windows.h>
#include <io.h>      // _setmode
#include <fcntl.h>   // _O_U16TEXT

#include <algorithm> // std::transform
#include <clocale>   // std::setlocale 
#include <iostream>

// Unicode UTF-16, little endian byte order (BMP of ISO 10646)
constexpr char CP_UTF_16LE[] = ".1200";

std::wstring wsup(const std::wstring& in) {
    std::wstring rv = in;

    std::transform(rv.begin(), rv.end(), rv.begin(),
        [](wchar_t ch) -> wchar_t {
            // 1, 2 and 3 can be put in any order you like
            // as long as you keep them in the top section
            if (ch == L'1') return 0x00B9;
            if (ch == L'2') return 0x00B2;
            if (ch == L'3') return 0x00B3;

            // ...but this must be here in the middle:
            if (ch >= '0' && ch <= '9') return 0x2070 + (ch - L'0');

            // put the below in any order you like,
            // in the bottom section
            if (ch == L'i') return 0x2071;
            if (ch == L'+') return 0x207A;
            if (ch == L'-') return 0x207B;
            if (ch == L'=') return 0x207C;
            if (ch == L'(') return 0x207D;
            if (ch == L')') return 0x207E;
            if (ch == L'n') return 0x207F;

            return ch; // no change
        });
    return rv;
}

std::wstring wsub(const std::wstring& in) {
    std::wstring rv = in;

    std::transform(rv.begin(), rv.end(), rv.begin(),
        [](wchar_t ch) -> wchar_t {
            if (ch >= '0' && ch <= '9') return 0x2080 + (ch - L'0');
            if (ch == L'+') return 0x208A;
            if (ch == L'-') return 0x208B;
            if (ch == L'=') return 0x208C;
            if (ch == L'(') return 0x208D;
            if (ch == L')') return 0x208E;
            if (ch == L'a') return 0x2090;
            if (ch == L'e') return 0x2091;
            if (ch == L'o') return 0x2092;
            if (ch == L'x') return 0x2093;
            if (ch == 0x0259) return 0x2094; // small letter schwa: ə
            if (ch == L'h') return 0x2095;
            if (ch >= 'k' && ch <= 'n') return 0x2096 + (ch - 'k');
            if (ch == L'p') return 0x209A;
            if (ch == L's') return 0x209B;
            if (ch == L't') return 0x209C;

            return ch; // no change
        });
    return rv;
}

int main() {
    std::setlocale(LC_ALL, CP_UTF_16LE);
    if (_setmode(_fileno(stdout), _O_U16TEXT) == -1) return 1;

    auto pstr = wsup(L"0123456789 +-=() ni");
    auto bstr = wsub(L"0123456789 +-=() aeoxə hklmnpst");

    std::wcout << L"superscript:   " << pstr << L'\n';
    std::wcout << L"subscript:     " << bstr << L'\n';

    std::wcout << L"an expression: x" << wsup(L"(n-1)") << L'\n';
}

Output:

superscript:   ⁰¹²³⁴⁵⁶⁷⁸⁹ ⁺⁻⁼⁽⁾ ⁿⁱ
subscript:     ₀₁₂₃₄₅₆₇₈₉ ₊₋₌₍₎ ₐₑₒₓₔ ₕₖₗₘₙₚₛₜ
an expression: x⁽ⁿ⁻¹⁾

My console didn't manage to display the subscript versions of hklmnpst - but apparently the transformation was correct because it shows up here ok after copy/pasting.

Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
  • With modifications, this worked excellently! For anyone who stumbles across this after the fact, my C++ standard did necessitate I rewrite the constexpr functions in one-line return statements. I also had to declare wmain as extern "C", and that I add the "-municode" flag to my compiler build. Thanks! – Michael Luger Feb 06 '20 at 17:10
  • 1
    @MichaelLuger You're welcome! You could just remove the `constexpr` part from the functions. I just thought it would be good to have them `constexpr` if one wants to extend it with some compile time generation of "good stuff" that's used a lot. - It should be ok to rename `wmain` to `main` too. I can't verify that now, but I actually left `wmain` there by mistake :-) – Ted Lyngmo Feb 06 '20 at 17:14
  • I just checked, and `wmain()` or `main()` doesn't matter as I thought. Also, note that there's bugfix. My old `superscript` returned a superscripted `i` instread of a `1` - but my bad eyes didn't catch it. :) – Ted Lyngmo Feb 06 '20 at 18:02
  • Yes, I've edited it to fix that bug. I did, however, notice that the code fails to output to files. Characters I could type out print as normal, but printing any output derived from the superscript or subscript functions produces unexpected results: `superscript(1)` produces a 1 in subscript. `superscript(2)` prints the intended character. `superscript(3)` produces `ł`. I'm unsure what precisely is happening. `subscript(2)` leaves the file empty, without printing any other text. – Michael Luger Feb 06 '20 at 18:18
  • That's odd. I'll play around with that too. Are you using a `wofstream`? – Ted Lyngmo Feb 06 '20 at 18:23
  • I am using a `wofstream`, yes. – Michael Luger Feb 06 '20 at 18:34
  • I think you'll have to use `_wfopen()` to be able to use `_setmode` on the stream. I haven't found a way to use `_setmode` on `C++` streams. – Ted Lyngmo Feb 06 '20 at 19:05
  • I've stumbled across a way to use `_setmode` with `_wfopen`, but from what I've gathered, this has locked me into C-style wide character and string manipulation. I suppose this is manageable, at the cost of all strings needing to be converted to `const wchar_t*` before being printed. – Michael Luger Feb 06 '20 at 20:18
  • @MichaelLuger Yes, but if you've got that working I'll update the answer with a more convenient conversion routine that creates `wstrings` that you can do `.c_str()` on to get the `const wchar_t*`. – Ted Lyngmo Feb 06 '20 at 20:29
0

You should configure console and the program you open the file that it should interpret your string as its encoding (eg. utf32).

for example in windows you can set your console code page with SetConsoleOutputCP function. to view file different encoding you can add your file to vs solution, right click/open with / source code (text) with encoding than select your encoding.

idris
  • 488
  • 3
  • 6