0

A part of the project includes something similar to a scrolling 'stock ticker', where a larger string "scrolls across" a fixed width output string.

Using C++ 11 on Linux, the concept is clear when using latin characters. Something like this:

std::string inputString, outputString;
for (int inIdx = 0; inIdx < inputString.size(); inIdx++)
{
    // shift output one character left
    for (int i = 0; i < mOutputTextWidth - 1; i++)
        outputString[i] = outputString[i+1];

    // Append character to end of output
    if (inIdx < inputString.size())
        outputString[mTextWidth] = inputString.at(inIdx);
    sleep(1);
}

You would get something like:

[           ]
[          H]
[         HE]
[        HEL]
[      HELLO]
[     HELLO ]
[    HELLO  ]
[   HELLO   ]

I need to make this work for UTF-8 non-latin characters. From what I've read, it is a complex subject. In particular std::string::at or [] returns a char, which breaks on long UTF-8 characters.

In C++ what's the right way of doing this?

Eg. Japanese

[              ]
[            こ]
[          こん]
[        こんば]
[      こんばん]
[    こんばんは]
[  こんばんは  ]
[ こんばんは   ]

(I know the glyph widths will vary by language, that's ok. I just can't figure out how to manipulate UTF-8 strings)

Danny
  • 2,482
  • 3
  • 34
  • 48
  • I recently posted an answer to a similar question [here](https://stackoverflow.com/questions/60975518/unable-to-work-with-utf8-character-in-c). It may be useful in understanding how UTF-8 is represented in memory. – rustyx Apr 12 '20 at 13:04
  • UTF-8 support in standard C++ is sketchy. The best course of action heavily depends on your platform and toolset. If you want portable code, you probably want to use a third party library. – n. m. could be an AI Apr 12 '20 at 13:18
  • Moreover, if you want minimally competent Unicode support, you have no choice but to use a third party library. C++ has no facilities to determine screen width of a string, or to inspect whether a given character is a regular, zero-width, double-width, or combining one. – n. m. could be an AI Apr 12 '20 at 13:27
  • n. 'pronouns' m: Do you have some suggestions for a third party library? – Danny Apr 12 '20 at 14:03

2 Answers2

0

On systems that support Unicode natively (which includes Linux)1, you can simply use the standard C++ multibyte support and work with wchar_t types to handle one unicode code point at a time.

For example like this:

#include <algorithm>
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>

int main()
{
    std::string inputUTF8 = "こんばんは!"; // assuming this source is stored in UTF-8

    std::setlocale(LC_ALL, "en_US.utf8"); // tell mbstowcs use want UTF-8->wchar_t conversion
    std::wcout.imbue(std::locale("en_US.utf8")); // tell std::wcout we want wchar_t->UTF-8 output

    std::vector<wchar_t> buf(inputUTF8.size() + 1); // reserve space
    int len = (int)std::mbstowcs(buf.data(), inputUTF8.c_str(), buf.size()); // convert to wchar_t
    if (len == -1) {
        std::cerr << "Invalid UTF-8 input\n"; // mbstowcs can fail
        return 1;
    }
    std::wstring out;
    for (int i = 0; i < len * 2; i++)
    {
        out.assign(std::max(0, len - i), L' '); // fill with ideographic space (U+3000) before

        out.append(buf.data(), std::max(0, i - len), std::min(len, i) - std::max(0, i - len));

        out.append(std::max(0, i - len), L' '); // fill with ideographic space after

        std::wcout << L"[" << out << L"]\n";
    }
}

Output:

[      ]
[     こ]
[    こん]
[   こんば]
[  こんばん]
[ こんばんは]
[こんばんは!]
[んばんは! ]
[ばんは!  ]
[んは!   ]
[は!    ]
[!     ]

Beware that mbstowcs and other locale stuff is not thread-safe.

Another possibility is to use a library like iconv.


1 Unfortunately on Windows Unicode support is crippled; it's wchar_t is 16 bits long and actually represents UTF-16, as such the program will work for the basic plane code points only (which still includes the typical CJK symbols, but not unified Han or other symbols above U+FFFF). Though that can still be fixed by taking UTF-16 into account.

rustyx
  • 80,671
  • 25
  • 200
  • 267
0

After the numerous warnings against wchar I implemented a solution based on the comment referring to this post from rustyx. There might be holes in this approach but so far works for me when tested with English/latin and Japanese input.

(I believe the code below works only with UTF-8, not sure about other legacy encodings like EUC-JP, SHIFT_JIS, etc)

Note that symbolLength() identifies the number of code points present and which will not be the same as the screen width since differing widths (or zero width!) code points might be present.

TqString::TqString(const std::string &s) { assign(s); }

TqString::TqString(const char *cs)
{
    std::string s(cs);
    assign(s);
}

TqString::TqString(size_t n, char c)
{
    std::string s(n, c);
    assign(s);
}

TqString &TqString::operator=(const std::string &s)
{
    assign(s);
    return *this;
}

// Unlike size(), this returns the number of UTF-8 code points
// in the input string

size_t
TqString::symbolLength() const
{
    int  symCount = 0;
    int skipCount = 0;

    for (int i = 0; i < size(); i++)
    {
        unsigned char c = at(i);
        if (skipCount == 0)
        {
            if (c >= 0xF0)
                skipCount = 3;
            else if (c >= 0xE0)
                skipCount = 2;
            else if (c >= 0xC0)
                skipCount = 1;
        }
        else
        {
            --skipCount;
        }

        if (skipCount > 0)
            continue;

        symCount++;
    }
    return symCount;
}

// Scan input string, skipping over 'n' symbols, and returning the last

std::string
TqString::symbolAt(off_t n) const
{
    std::string outString;
    int skipCount = 0;
    int symCount = 0;

    for (int i = 0; i < size(); i++)
    {
        unsigned char c = at(i);
        if (skipCount == 0)
        {
            outString = c;
            if (c >= 0xF0)
                skipCount = 3;
            else if (c >= 0xE0)
                skipCount = 2;
            else if (c >= 0xC0)
                skipCount = 1;
        }
        else
        {
            outString += c;
            --skipCount;
        }

        if (skipCount > 0)
            continue;


        if (symCount == n)
            break;

        symCount++;
    }

    return outString;
}

void
TqString::shiftLeft()
{
    std::string outString;
    if (size() == 0)
    {
        assign(outString);
        return;
    }

    for (int i = 1; i < symbolLength(); i++)
    {
        outString += symbolAt(i);
    }

    assign(outString);
}

// shift then append 's' to the end
void
TqString::shiftLeft(const TqString &s)
{
    shiftLeft();
    append(s);
}

std::string
TqString::str() const
{
    std::string ret(data());
    return ret;
}
Danny
  • 2,482
  • 3
  • 34
  • 48