1

With english characters it is easy to extract, so to say, a char from a string, e.g., the following code should have y as output:

string my_word;
cout << my_word.at(1);

If I try to do the same with greek characters, I get a funny character:

string my_word = "λογος";
cout << my_word.at(1);

Output:

My question is: what can I do to make .at() or whatever similar function work?

many thanks!

3 Answers3

2

std::string is a sequence of narrow characters char. But many national alphabets use more then one char to encode single letter when using utf-8 locale. So when you take s.at(0) you get a half of whole letter or even less. You should use wide chars: std::wstring instead of std::string, std::wcout instead of std::cout and L"λογος" as string literal.

Also, you should set right locale before any printing using std::locale stuff.

Code example for this case:

#include <iostream>
#include <string>
#include <locale>

int main(int, char**) {
    std::locale::global(std::locale("en_US.utf8"));
    std::wcout.imbue(std::locale());
    std::wstring s = L"λογος";
    std::wcout << s.at(0) << std::endl;
    return 0;
}
user2807083
  • 2,962
  • 4
  • 29
  • 37
  • This might work in this case, but does not solve the problem in general. There are characters that doesn't fit into `wchar`, too. – JojOatXGME Sep 30 '17 at 10:39
1

Problem is complex. Non Latin characters have to be encoded properly. There are couple standards for that. Question is which encoding your system is using.

In UTF-8 encoding one character is represented by multiple bytes. It can vary form 1 to 4 bytes depending on what kind of character it is. For example: λ is represented by two bytes (in hex): CE BB.

I don't know what are the other character encoding which gives single byte characters fro Greek letters, but I'm sure there is one such encoding.

Note that your value my_word.length() most probably returns 10 not 5.

Marek R
  • 32,568
  • 6
  • 55
  • 140
1

As others have said, it depends on your encoding. An at() function is problematic once you move to internationalisation because Hebrew has vowels written around the character, for example. Not all scripts consist of discrete sequences of glyphs.

Generally it's best to treat strings as atomic, unless you are writing the display / word manipulation code itself, when of course you need the individual glyphs. To read UTF, check out the code in Baby X (it's a windowing system that has to draw text to the screen)

Here;s the link https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c

Here's the UTF8 code - it's quite a hunk of code but fundamentally strightforwards.

static const unsigned int offsetsFromUTF8[6] = 
{
    0x00000000UL, 0x00003080UL, 0x000E2080UL,
    0x03C82080UL, 0xFA082080UL, 0x82082080UL
};

static const unsigned char trailingBytesForUTF8[256] = {
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};

int bbx_isutf8z(const char *str)
{
  int len = 0;
  int pos = 0;
  int nb;
  int i;
  int ch;

  while(str[len])
    len++;
  while(pos < len && *str)
  {
    nb = bbx_utf8_skip(str);
    if(nb < 1 || nb > 4)
      return 0;
    if(pos + nb > len)
      return 0;
    for(i=1;i<nb;i++)
      if( (str[i] & 0xC0) != 0x80 )
        return 0;
    ch = bbx_utf8_getch(str);
    if(ch < 0x80)
    {
      if(nb != 1)
        return 0;
    }
    else if(ch < 0x8000)
    {
      if(nb != 2)
        return 0;
    }
    else if(ch < 0x10000)
    {
      if(nb != 3)
        return 0;
    }
    else if(ch < 0x110000)
    {
      if(nb != 4)
        return 0;
    }
    pos += nb;
    str += nb;    
  }

  return 1;
}

int bbx_utf8_skip(const char *utf8)
{
  return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}

int bbx_utf8_getch(const char *utf8)
{
    int ch;
    int nb;

    nb = trailingBytesForUTF8[(unsigned char)*utf8];
    ch = 0;
    switch (nb) 
    {
            /* these fall through deliberately */
        case 3: ch += (unsigned char)*utf8++; ch <<= 6;
        case 2: ch += (unsigned char)*utf8++; ch <<= 6;
        case 1: ch += (unsigned char)*utf8++; ch <<= 6;
        case 0: ch += (unsigned char)*utf8++;
    }
    ch -= offsetsFromUTF8[nb];

    return ch;
}

int bbx_utf8_putch(char *out, int ch)
{
  char *dest = out;
  if (ch < 0x80) 
  {
     *dest++ = (char)ch;
  }
  else if (ch < 0x800) 
  {
    *dest++ = (ch>>6) | 0xC0;
    *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x10000) 
  {
     *dest++ = (ch>>12) | 0xE0;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x110000) 
  {
     *dest++ = (ch>>18) | 0xF0;
     *dest++ = ((ch>>12) & 0x3F) | 0x80;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else
    return 0;
  return dest - out;
}

int bbx_utf8_charwidth(int ch)
{
    if (ch < 0x80)
    {
        return 1;
    }
    else if (ch < 0x800)
    {
        return 2;
    }
    else if (ch < 0x10000)
    {
        return 3;
    }
    else if (ch < 0x110000)
    {
        return 4;
    }
    else
        return 0;
}

int bbx_utf8_Nchars(const char *utf8)
{
  int answer = 0;

  while(*utf8)
  {
    utf8 += bbx_utf8_skip(utf8);
    answer++;
  }

  return answer;
}
Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18