2

I am a rookie with C++. I have a string "tỏa" but I can't get the character 'ỏ' and why the length of that string is 5? How can I get that character as a variable?

void test() {
    std::string str ("tỏa");
    for(int i=0; i<str.length(); ++i){
        std::cout << str[i] << std::endl;
    }
}

And the output of that code is:

t
�
�
�
a

Anyone can help me? Thank in advance.

anastaciu
  • 23,467
  • 7
  • 28
  • 53
  • 4
    You probably have saved file as UTF-8. in that case, the middle character will be represented in few bytes, not just single byte. – Afshin Feb 18 '20 at 09:35
  • @Afshin Yeppp, I saved it in UTF-8 but How can I get that character like std::string a = str[i]. Thanks you – Bui Ngoc Bao Feb 18 '20 at 09:55
  • You need to read this https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ – n. m. could be an AI Feb 18 '20 at 10:02

4 Answers4

4

Use a combination of setlocale() and wstring:

Link to live sample

#include <clocale>
#include <iostream>


void test() {
    std::wstring str = L"tỏa";
    for(int i=0; i<str.length(); ++i){
        std::wcout << str[i] << std::endl;
    }
    std::wcout << "Size: " << str.size(); //the size of the string is 3 as it should
}

int main()
{   
    setlocale(LC_ALL, "");
    test();
    return 0;
}

EDIT:

If you want to save the wide char in a variable it's as simple as:

wchar_t ch = str[1];

You can also use the ASCII code:

wchar_t ch = 7887;

Note: This may not work in all compilers in all SO's, 100% portability is not guaranteed.

anastaciu
  • 23,467
  • 7
  • 28
  • 53
  • thanks you but if I want to get that character not print it? How can I do? – Bui Ngoc Bao Feb 18 '20 at 09:53
  • 1
    @BuiNgocBao, that is what your function does. Maybe you should clarify your question. – anastaciu Feb 18 '20 at 09:54
  • thank for your help. I just want to get it as a variable to find it in a json file like "int find = char_dict[std::string(1, input_lines[0][1])];". * input_lines[0][1] is that character "ỏ" – Bui Ngoc Bao Feb 18 '20 at 09:59
  • @BuiNgocBao you already have the string int the right encoding, if you want to use one of the characters or a substring just do it, it's there. – anastaciu Feb 18 '20 at 10:04
  • Can you explain more for me? I am the rookie in C++. for example 7887 is the string int for "ỏ". How can I convert 7887 to "ỏ"? – Bui Ngoc Bao Feb 18 '20 at 10:09
  • @BuiNgocBao `wchar_t ch = 7887;` – anastaciu Feb 18 '20 at 10:12
  • This may or may not work depending on the OS, compiler, and terminal emulator in use. – n. m. could be an AI Feb 18 '20 at 10:14
  • @anastaciu You are so kind sir. I have tried your way wchar_t ch = str[1] and when I print it to my console as "std::cout << ch << std::endl;". It displayed 7887 not "ỏ". I want to save "ỏ" as a string to mapping – Bui Ngoc Bao Feb 18 '20 at 10:15
  • @BuiNgocBao you must use `std::wcout` to print wide chars. – anastaciu Feb 18 '20 at 10:16
  • @anastaciu Yeppp but I want to save "ỏ" as a string not a wide chars "7887" sir. You have any ideas? – Bui Ngoc Bao Feb 18 '20 at 10:19
  • @n.'pronouns'm., I added a note to state that. – anastaciu Feb 18 '20 at 10:19
  • @BuiNgocBao You lost me, if you want to save wide chars you use wide char containers, you can't save them with the right encoding in a normal string or char variable becase they are too large to fit. – anastaciu Feb 18 '20 at 10:22
  • @anastaciu AHHH, I have understand you.Anw, If I have a json file like "{ "ỏ": 5}", how can i get the result 5 for "ỏ"? (ỏ in a String like "Tỏa") – Bui Ngoc Bao Feb 18 '20 at 10:28
  • @BuiNgocBao, This is a whole different question, you should ask a new question explaining exactly what you need, preferably with a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example), the right tags, including json, and, if possible, a sample of the json file. – anastaciu Feb 18 '20 at 10:33
  • @BuiNgocBao, take a look here https://stackoverflow.com/questions/35745413/how-to-use-decodestring-in-jsoncpp-to-decode-a-string-containing-unicode-charact see if it helps. – anastaciu Feb 18 '20 at 10:39
  • if you use UTF-16 wchar_t then you'll face problems with characters outside the BMP. It's not the general solution to the problem – phuclv Feb 19 '20 at 10:22
  • @phuclv, thanks for your comment, yes that is correct, and that's only scratching the surface, `wchar_t` is a can of worms, but for this sample it seemed appropriate. – anastaciu Feb 19 '20 at 10:55
  • as I said. Just use a library. For very simple cases you can just read the UTF-8 encoding and know how many bytes to grab, but in general it's far too complex to handle manually. It's a Unicode issue and doesn't relate to C++ – phuclv Feb 20 '20 at 01:58
3

You probably have saved file as UTF-8. in that case, the middle character will be represented in few bytes, not just single byte. So if you print it with 1 char each line, you will see some strange chars.

If you just remove std::endl, you will probably see your string. Because on that case, console can handle string as UTF-8 output (I think Linux based consoles just do that by default).

Note: To handle UTF-8, you may need to add the following to your code:

std::setlocale(LC_ALL, "en_US.UTF-8");
Afshin
  • 8,839
  • 1
  • 18
  • 53
2

std::string is not suited for holding characters larger than 1 byte: "ỏ" in your case.

"5" means length of your string in bytes. Because std::string still can store such strings as yours but it is hard to handle strings that way.

Try to use std::wstring.

You can read here about wide characters: https://en.wikipedia.org/wiki/Wide_character

Denis
  • 89
  • 4
2

The character ỏ is a part of Extended Ascii (see https://theasciicode.com.ar/extended-ascii-code/letter-o-circumflex-accent-ascii-code-226.html).

If your console isn't able to recognize UTF-8, such characters (2+ bytes) will be represented with multiple boxes.

You might want to use std::wstring (http://www.cplusplus.com/reference/string/wstring/) to solve this problem.

Zàkelis
  • 39
  • 1
  • 4
  • thanks you but if I want to get that character not print it? How can I do? – Bui Ngoc Bao Feb 18 '20 at 09:53
  • Puh-lease. There is no such thing as "extended ASCII". Don't trust every two bit site you find on the interwebs. – n. m. could be an AI Feb 18 '20 at 10:04
  • @n.'pronouns'm. "No such thing as extended ASCII" isn't really correct. It is more that it is a catch-all term for all the different encodings that people mapped into the high 128 codes of an 8-bit character. But that said, I agree that any website that claims that there exists one single answer for the characters in "extended ASCII" isn't worth looking at. I mean: In their own history section they write that the other name for what they list is "code page 437", as if that isn't a hint that more than one encoding exists. – Frodyne Feb 18 '20 at 11:51
  • @Frodyne it would only be half-bad if the symbol in question really was the "letter o with a circumflex" and/or belonged to any of those 8 bit "extended ASCII" codepages. It isn't, and it doesn't. This is one example of the term "extended ASCII" being misleading and harmful. – n. m. could be an AI Feb 18 '20 at 13:30