Farsi character utf8 in c++

Question

i m trying to read and write Farsi characters in c++ and i want to show them in CMD first thing i fix is Font i add Farsi Character to that and now i can write on the screen for example ب (uni : $0628) with this code:

#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);
    wcout << L"\u0628 \n";
    wcout << L"ب"<<endl;
    system("pause");
}

but how i can keep this character ... for Latin characters we can use char or string but how about Farsi character utf8 ?!

and how i can get them ... for Latin characters we use cin>>or gets_s

should i use wchar_t? if yes how? because with this code it show wrong character ...

wchar_t a='\u0628';
wcout <<a;

and i can't show this character بـ (uni $FE91) even though that exist in my installed font but ب (uni $0628) showed correctly

thanks in advance

You need to read much more about [UTF-8](https://en.wikipedia.org/wiki/UTF-8) & [Unicode](https://en.wikipedia.org/wiki/Unicode). Notably http://utf8everywhere.org/ Hint: an UTF-8 character can span several bytes. — Basile Starynkevitch, May 26 '17 at 11:35
And you can find many libraries to parse UTF8. [libunistring](https://www.gnu.org/software/libunistring/manual/libunistring.html) is one — Basile Starynkevitch, May 26 '17 at 11:41
`wchar_t` is a 2-byte character used in the past for UTF16, not UTF8. Nowadays, char16_t is used for UTF16 and char32_t for UTF32, with corresponding STL string classes, eg u16string. There is no specialized type for UTF8. `char` is used whenever UTF8 is required, which *can* lead to problems. — Panagiotis Kanavos, May 26 '17 at 11:44
You can specify [Unicode string literals](https://msdn.microsoft.com/en-us/library/69ze775t.aspx) by using the appropriate prefix, eg `u8"A"` specifies a UTF8 string, `u"Abc"` a UTF16 string, `u8" = \U0001F607 is O:-)"` is UTF8, ` u" = \U0001F603 is :-D"` is UTF16 — Panagiotis Kanavos, May 26 '17 at 11:46
Finally, it's not that you need to read about UTF8. C++ just has [bad support for Unicode](https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11), compared to other languages. For example, there are no UTF16/32 stream types yet, forcing you to convert from UTF16 to UTF8 to use cin, cout. You have to use the same types for ANSI encoded and UTF8 text, making it very risky to use text from multiple encodings with the same code. You'll have to ensure that you use char, string throughout and convert every input to UTF8 if there is any chance that the encoding isn't UTF8 — Panagiotis Kanavos, May 26 '17 at 12:17
Let's see... 1) UTF-8 characters use `char`. Since they can span multiple bytes (and thus multiple `char`s), you should probably use a `std::string` or, failing that, a `char*`. 2) C++ only has partial support for Unicode; UTF-8, in particular, is poorly implemented, due to reusing `std::string` (which considers every `char` to be a distinct character). 3) When declaring a character or string literal, the prefix `L` indicates that `wchar_t` should be the character type used for that literal (i.e., `wchar_t` for character literal, and `const wchar_t[]` for string literal). — Justin Time - Reinstate Monica, May 26 '17 at 16:01

YePhIcK · Accepted Answer · 2017-05-26T22:11:43.437

2

The solution is the following line:

wchar_t a=L'\u0628';

The use of L tells the compiler that your type char is a wide char ("large" type, I guess? At least that's how I remember it) and this makes sure the value doesn't get truncated to 8 bits - thus this works as intended.

UPDATE

If you are building/running this as a console application in Windows you need to manage your code pages accordingly. The following code worked for me when using Cyrillic input (Windows code page 1251) when I set the proper code page before wcin and cout calls, basically at the very top of my main():

SetConsoleOutputCP(1251);
SetConsoleCP(1251);

For Farsi I'd expect you should use code page 1256.

Full test code for your reference:

#include <iostream>
#include <Windows.h>

using namespace std;

void main()
{
    SetConsoleOutputCP(1256); // to manage console output
    SetConsoleCP(1256);       // to properly process console input

    wchar_t b;
    wcin >> b;
    wcout << b << endl;
}

edited May 26 '17 at 22:11

answered May 26 '17 at 15:55

YePhIcK

5,816
2
27
52

i write this code also but it not work again ... show me wrong character `wchar_t b=L''; wcin>>b; wcout< – Morteza Rahimzade May 26 '17 at 19:02
Is that a console application? It is possible that your codepage is not set up properly. On which OS are you running this? – YePhIcK May 26 '17 at 22:01
1

thanks for your help my friend (@YePhIcK) when i use code page 1256 it show correct character but i have another problem yet ... in Farsi for example we have 4 glyph for one character (character : kāf | ک | کـ | ـکـ | ـک ) each one have special code point but i can' t show all of them ... before i use this code: `_setmode(_fileno(stdout), _O_U16TEXT); wcout< – Morteza Rahimzade Jun 01 '17 at 14:49

Farsi character utf8 in c++

1 Answers1