1

I need to print the characters of a string (I use VS Code) but separate into a for loop, i don't have problems if I try with words that do not have an accent and the program work, but if i try with an accent word, the program crash (my program is on spanish that's why I need the accent and chars like 'ñ á í...'). I let you some examples.

//My simple code
#include <iostream>
#include <windows.h>

using namespace std;

int main(){
    SetConsoleOutputCP(65001);

    string word = "arbol"; 
/*
With this word prints : a - r - b - o - l
If I put "árbol", just prints: - 
*/

    for (int i = 0; i < 5; i++)
    {
        cout << word[i] << " - ";
        
    }
    return 0;
}

Also, if I don't put the " - " in the end of the cout, any word is printed. I know that this is a beginner's question but I don't know what I to do to work it. hanks for your help :)

  • 1
    Ugh, `::std::string::operator []` can not be used to access individual characters when using multibyte character set. And it is not a good idea in general to store multibyte text in `::std::string` or print it through `::std::cout`. – user7860670 Aug 08 '21 at 16:57
  • The question body says your program crashes, but the comment says it just doesn't print the correct text. Which is it? – interjay Aug 08 '21 at 17:04
  • Is `word.size() == 1` when you use `"árbol"`? That might mean that the string is being poorly interpreted as UTF-32 or UTF-16. – Bill Lynch Aug 08 '21 at 17:17
  • Does this answer your question? [How do I properly use std::string on UTF-8 in C++?](https://stackoverflow.com/questions/50403342/how-do-i-properly-use-stdstring-on-utf-8-in-c) – TruthSeeker Aug 08 '21 at 17:55

1 Answers1

1

Based on your choice of code page 65001, you're using UTF-8 characters. UTF-8 can take multiple bytes (or char units) to make a single code point, and breaking one up in the middle is almost certain to lead to something invalid. The inventors of UTF-8 were very clever though. First, any of the characters in the ASCII range of 0 to 0x7f including all the unaccented letters will be a single byte, so a single char is all you need. For all other bytes you can look at the top 2 bits - if it's 0xc0 then it's the start of a new code point, if it's 0x80 it's a continuation of the previous one.

for (int i = 0; i < word.length(); )
{
    cout << word[i++];
    while ((i < word.length()) && ((word[i] & 0xc0) == 0x80)))
    {
        cout << word[i++];
    }
    cout << " - ";
}

This doesn't get you completely out of the woods, because Unicode gives multiple ways of creating a character. Your á can be either a single precomposed code point U+00e1 or it can be two code points, the normal a U+0061 followed by the combining acute accent U+0301. If you run into the second situation the code presented above will not work, and you'll almost certainly need the help of a Unicode library which is beyond the scope of a StackOverflow answer.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622