C++ String object wrong characters

Question

I have the following code:

#include <iostream>
#include <cstdlib>

int main(){
  string letra = "méxico";

  for(int i=0;i<letra.size();i++){
    cout << letra[i] << endl;
  }

  return 0;
}

What I get as a result:

m
�
�
x
i
c
o

Why are 7 characters instead of 6?, If I do this:

cout << letra << endl;

I get:

méxico

What's going on? I've tried using

setlocale(LC_ALL,es_MX.UTF-8);
setlocale(LC_ALL,"");

And although the function does not return "NULL", it does not work. I use Codeblocks 16.01, gcc 4.9, g++ 4.9 on Linux.

You have a typo, `i==0` should be `i=0`. Also you should use [`wstring` instead of `string`](https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring) along with [`wcout` instead of `cout`](https://stackoverflow.com/questions/8788057/how-to-initialize-and-print-a-stdwstring) — Cory Kramer, May 01 '17 at 19:17
Required Reading for unicode: https://en.wikipedia.org/wiki/Precomposed_character — Mooing Duck, May 01 '17 at 19:18
@CoryKramer That depends. Actually, the top answer to the Q you linked advices *against* using `wstring` on Linux. The important point is that some characters can take up more than one `char` in an `std::string`. Printing the complete string should work fine on Linux. — Baum mit Augen, May 01 '17 at 19:21
@Javier Ramírez: It works fin with gcc on Windows using `setlocale(LC_ALL, "");` also note you are using `==` instead of `=` inside the `for` loop. — Shadi, May 01 '17 at 19:31
@JavierRamirez -- I suggest *not* to put quoted string in your source code that are not ASCII characters. The reason is that you don't know what the source editor will do with those characters (the accented `e`) when the file is saved. You need to be darn sure what the editor will translate those characters to when the source file is saved. If the file saves that string as something you didn't expect, all of the locale functions in the world are not going to help you, since it will be too late. — PaulMcKenzie, May 01 '17 at 19:41
I had already seen this stackoverflow.com/questions/31357380/c-non-ascii-letters. And no, it does not work. Please, before placing "duplicate" first make sure you know if that worked for me. — Javier Ramírez, May 01 '17 at 19:54
@JavierRamírez -- You don't know or have not told us exactly what the final string in the executable really is. Like I said, you cannot guarantee a quoted string in your source code that uses non-ASCII characters will work as you expect. Show us the actual string of bytes in your program that represents that string -- if the `e` has been mangled in any way, well, you were warned. — PaulMcKenzie, May 01 '17 at 19:58

JoaoBapt · Answer 1 · 2017-05-02T10:41:36.963

0

std::string doesn't recognize encodings; its operator[] returns its individual bytes, not individual characters.

On Unicode, é is actually composed of two bytes, and with letra[i] you get each one of those individually. However, operator<< for std::strings does the right job by "composing" those two bytes and actually printing the character é.

To access each character individually (not random access however), you can use mbtowc defined in <cstdlib>:

int mbtowc(wchar_t* pwc, const char* s, size_t n);

It examines at most n bytes of s to find a character and place it at pwc, returning the number of bytes read. Your printing routine becomes something like this:

mbtowc(nullptr, 0, 0);
for (int i = 0; i < letra.size();)
{
    wchar_t wc;
    int r = mbtowc(&wc, &letra[i], letra.size()-i);
    cout << wc << endl;
    if (r <= 0) break;
    i += r;
}

edited May 02 '17 at 10:41

answered May 01 '17 at 19:41

JoaoBapt

195
1
11

But then, how do I get each of the characters? – Javier Ramírez May 01 '17 at 19:46
@JavierRamírez I edited my answer to add a way. However, it'd be best if you standardized to UTF-8 (just prepend your strings with `u8`) and used proper UTF-8 library like http://utfcpp.sourceforge.net. – JoaoBapt May 01 '17 at 20:06
Thanks for you time, friend! – Javier Ramírez May 01 '17 at 20:34

C++ String object wrong characters

1 Answers1