0

My program get an input chinese string in utf32 encoding:

./myprogram 我想玩 

I want to convert this to utf8, for this I am using library UTF8-CPP http://utfcpp.sourceforge.net

#include "source/utf8.h"
using namespace std;
int main(int argc, char** argv)
{
    printf("argv[1] = %s \n", argv[1]);
    string str = argv[1];
    printf("str = %s \n", str);

    vector<unsigned char> utf8result;
    utf8::utf32to8(str.begin(), str.end(), back_inserter(utf8result));

I got the next output in terminal:

argv[1] = 系 
str =  D�k� 
terminate called after throwing an instance of 'utf8::invalid_code_point'
  what():  Invalid code point

How to fix this code, so the conversation utf32to8 will be successfull? What am I doing wrong, please, explain me ? After that I want to write received utf8 to file.

Vcvv
  • 71
  • 4
  • 2
    Your string is not UTF-32 to start with. Your first task is to understand what encoding you are starting with. Try printing argv[1] as a sequence of byte values. If it's still not clear post those byte values here. – john Jan 27 '18 at 09:12
  • hey, thanks for your answer, argv[1] is probably already in utf8 encoding, so I don't need to convert it. But I have task, that I should convert UTF32 to UTF8 and this UTF32 need to be passed to program via command line argument, like I showed in desktiption. Could you help me, what should I do ? – Vcvv Jan 27 '18 at 09:39

2 Answers2

1

The command on most Linux distributions passes in UTF-8 in, so you need to convert it to UTF-32 when you receive it and then convert it back when you print it out.

Or you could create a UTF-32 string in the program eg. std::u32string u32s = U"我想玩";

#include "source/utf8.h"

int main()
{
    std::u32string u32s = U"我想玩";

    std::string u8s;
    utf8::utf32to8(u32s.begin(), u32s.end(), std::back_inserter(u8s));

    std::cout << u8s << '\n';
}

Note:

From C++11 onwards you don't need to use 3rd party UTF libraries, the Standard Library has its own, although they are not easy to use.

You can write nicer functions to wrap them like in this answer here:

Any good solutions for C++ string code point and code unit?

Galik
  • 47,303
  • 4
  • 80
  • 117
  • 1
    Not sure how a good idea is to use the standard library for that, because the "Locale-independent unicode conversion facets" are already deprecated in C++17. – Andrei Damian Jan 27 '18 at 10:58
  • @AndreiAndrey The current functions won't be removed until well after the replacement is in place. And also by wrapping the standard functions in user friendly functions the way I have in the linked answer you are completely protected. You simply replace the code in the rapper functions. Any software you write only needs a slight adjustment in one place. So I don't see a problem. – Galik Jan 27 '18 at 12:05
  • @AndreiAndrey I feel it is much better tracking what happens in the Standard Library than the potential changes in a *third party library*. The Standards committee are trying harder than anyone to maintain backward compatibility. – Galik Jan 27 '18 at 12:08
  • I agree with what you are saying. I just wanted to point out it's far from the ideal solution. – Andrei Damian Jan 27 '18 at 12:38
  • @AndreiAndrey It is frustrating they finally put some unicode support in and then change it right after. From what I have seen of the changes, though, they don't look so large. IIRC it is more like replacing what's there with something similar but less hard-coded. – Galik Jan 27 '18 at 12:44
0

Most likely argv[1] is already stored with Utf-8 encoding. Because this is default way to handle Unicode in Linux. Note that utf32 characters can not be properly represented by std::string or by C-style null-terminated array of char because every item occupies 4 bytes.

user7860670
  • 35,849
  • 4
  • 58
  • 84