0

How do i loop through the letters of a string when it has non ASCII charaters? This works on Windows!

for (int i = 0; i < text.length(); i++)
{
    std::cout << text[i]
}

But on linux if i do:

std::string text = "á";
std::cout << text.length() << std::endl;

It tells me the string "á" has a length of 2 while on windows it's only 1 But with ASCII letters it works good!

kovacsmarcell
  • 461
  • 2
  • 11
  • 4
    Read about Unicode and UTF8 UTF16 encoding. – πάντα ῥεῖ Jul 11 '15 at 12:53
  • I assume because a string consists out of the letter and '\0', therefore a length of two. – arc_lupus Jul 11 '15 at 12:54
  • Perhaps use `std::wstring`? – Ed Heal Jul 11 '15 at 12:54
  • @arc_lupus, `std::string` doesn't count the nul terminator in `length()`. – chris Jul 11 '15 at 12:54
  • 1
    [Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html). On Windows if you switch to a codepage without the character `á` you'll see strange thing happens. – phuclv Jul 11 '15 at 13:26

2 Answers2

3

In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.

On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.

Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.

If you need more utf-8 support, look for a good library that implements what you need.

You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.

Also have a look at this for information on how to handle different character encodings portably.

Community
  • 1
  • 1
Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
  • Windows doesn't always use Windows-1252 – phuclv Jul 11 '15 at 13:27
  • @LưuVĩnhPhúc Here it apparently did (or it used another encoding where á is a single byte character, which was the point of the question and answer). But you are right, this is a little bit imprecise. Sadly, I do not know too much about how Windows handles such characters when it doesn't use Windows-1252, if you know how to formulate this better, feel free to edit or answer yourself. – Baum mit Augen Jul 11 '15 at 13:30
  • @LưuVĩnhPhúc Tried to make it better, is this correct as it stands? – Baum mit Augen Jul 11 '15 at 13:35
  • IMHO "In your Windows system's codepage, á is a single byte character" is more correct – phuclv Jul 11 '15 at 13:37
  • @celticminstrel That does indeed seem to provide at least some facilities. If it does everything you need, you can certainly use it. But as you can see in the table [here](http://en.cppreference.com/w/cpp/locale/codecvt), it has its limitations. – Baum mit Augen Jul 11 '15 at 13:57
1

Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.

#include <iostream>
#include <string>

int main()
{
    std::wstring text = L"áéíóú";

    for (int i = 0; i < text.length(); i++)
        std::wcout << text[i];

    std::wcout << text.length() << std::endl;
}
Avilius
  • 25
  • 3
  • Questionable advice for Linux systems since utf-8 is based on a series of single byte characters that get interpreted in a special way. Unfortunately, `wchar_t` is still a fixed width character type, which still does not match the variable length encoding utf-8 uses, at least not better than `char` (and thus `std::string`) does. You might want to read [this](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring). – Baum mit Augen Jul 11 '15 at 13:24
  • Thanks! I was recommending as it is a solution that would work on multiple platforms in a consistent manner, however. If you are solely using Linux I definitely would recommend `std::string`. – Avilius Jul 11 '15 at 13:30
  • Is std::u16string worth using? Is it portable? – kovacsmarcell Jul 11 '15 at 13:37
  • @kovacsmarcell I would not fiddle with utf16 if I was you. http://stackoverflow.com/questions/16208079/how-to-work-with-utf-8-in-c-conversion-from-other-encodings-to-utf-8 http://utf8everywhere.org/ – Baum mit Augen Jul 11 '15 at 13:51
  • It's still a variable length encoding, as well. – celticminstrel Jul 11 '15 at 13:51
  • is std::wstring portable? (I don't want to save it to a file) – kovacsmarcell Jul 11 '15 at 15:23