0

I am working with Unicode in C++11 & I am right now unable to convert std::string to std::u32string.

My code is as follows:

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::string str="hello☺";

    std::u32string s(str.begin(),str.end());

    icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
    std::cout << "Unicode string is: " << ustr << std::endl;

    std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

    std::cout << "Individual characters of the string are:" << std::endl;
    for(int i=0; i < ustr.countChar32(); i++)
      std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

    return 0;
}

On executing the output is: (which is not expected)

Unicode string is: hello�������
Size of unicode string = 12
Individual characters of the string are:
h
e
l
l
o
�
�
�
�
�
�
�

Please suggest if any ICU library function exists for this

dashthird
  • 47
  • 9
  • Is there a point in using UTF-32 ? – Michael Chourdakis Feb 08 '20 at 13:50
  • Since there's a `fromUTF32` function, there should be a `toUTF32` there as well, somewhere. This is what you will need to use to convert a `std::string` to a `std::u3string`. Copying each character of a `std::string` into each unicode value in a `std::u32string` is not going to accomplish anything useful. – Sam Varshavchik Feb 08 '20 at 13:57
  • You can probably adapt the `widen` function in the following post to do what you want: https://stackoverflow.com/questions/51210723/how-to-detect-â€-combination-of-unicode-in-c-string/51212415#51212415 – Paul Sanders Feb 08 '20 at 14:59
  • ICU uses UTF-16 representation. `str` in your example is not UTF-32 encoded. Why again do you want UTF-32 in either direction? Most likely, `str` is in UTF-8, and you want `UnicodeString::fromUTF8` – Igor Tandetnik Feb 08 '20 at 15:58
  • @MichaelChourdakis I am trying to use UTF-32 so that any of the possible Unicode characters can be processed – dashthird Feb 08 '20 at 20:30
  • @dashthird you really need to? – Michael Chourdakis Feb 08 '20 at 20:33
  • @MichaelChourdakis Is my approach correct/efficient? – dashthird Feb 08 '20 at 20:41
  • 1
    @dashthird nobody uses UTF-32 today. If in OS, use UTF-16. If in web, use UTF-8. It's extremely unlikely that you 'll encounter some character beyong the BMP so that UTF-16 wouldn't be sufficient. – Michael Chourdakis Feb 08 '20 at 20:51
  • @MichaelChourdakis There are perfectly valid reasons to use UTF-32. Just because it's rarely used as interchange encoding doesn't mean that it cannot be very useful as internal representation for doing string manipulations. – jlh Dec 31 '20 at 14:30

2 Answers2

5

The output makes sense. Presumably you thought you were defining a string with 7 characters? Take a look at str.size(). You defined a string with 12 characters!

Even though you were able to type "hello☺" into your program, this string literal does not consist of just seven bytes. Each of the last two characters gets expanded into multiple bytes, as those characters fall outside the expanded ASCII range (0 to 255 or -128 to 127). The result is a 12-byte string literal, which initializes a 12-character string, which in turn initializes a 12-character u32string. You've mangled the characters you intended to represent.

Example: The character '☺' is represented as the three bytes \0xE2\0x98\0xBA. If char is signed on your system (likely), these three bytes take on the values -30, -104, and -70. The conversion to char32_t promotes each of these values to 32 bits then converts signed to unsigned, resulting in the three values 4294967266, 4294967192, and 4294967226. What you presumably wanted was to concatenate these bytes into the single char32_t value \0x00E298BA. However, your conversion does not provide a mechanism for (re-)combining bytes.

Similarly, the character '' is represented by the four bytes \0xF0\0x9F\0x98\0x86. These were converted into four 32-bit integers instead of the single value \0xF09F9886.

To get the result you want, you need to tell the compiler to interpret your string literal as 7 characters. Try the following initialization of s:

std::u32string s = U"hello☺";

The U prefix on the string literal tells the compiler that each character represents a UTF-32 character. This results in the desired 7-character string (assuming your compiler and editor agree on character encodings, which I think is reasonably likely).


Gratis debugging takeaway: When your output is not what you expect, check the data at each stage to make sure your input is what you expect.

JaMiT
  • 14,422
  • 4
  • 15
  • 31
  • You said ```The U prefix on the string literal tells the compiler that each character represents a UTF-32 character.``` What if I need to input the string? If I replace ``` std::string str="hello☺"; std::u32string s(str.begin(),str.end());``` with ```std::u32string s; std::cin >> s;``` I get the error ```error: cannot bind ‘std::istream {aka std::basic_istream}’ lvalue to ‘std::basic_istream&&’ std::cin >> s; ``` – dashthird Feb 08 '20 at 20:42
  • 1
    @dashthird _"What if I need to input the string?"_ -- maybe that should have been mentioned in the question? Also relevant would be whether the input comes from a stream or from an API, as the best answer is along the lines of using the `U` prefix: start as close to the desired format as possible. For console input, either use `std::wcin` and ICU's `wchar_t` support or `std::cin` and [u32string conversion from string](https://stackoverflow.com/questions/31302506/stdu32string-conversion-to-from-stdstring-and-stdu16string). – JaMiT Feb 08 '20 at 22:10
0

Thanks everybody for help!

Using these 2 links, I was able to found some relevant functions:

I tried using codecvt functions, but I got the error:

fatal error: codecvt: No such file or directory
 #include <codecvt>
                   ^
compilation terminated.

So, I skipped that & on further searching, I found mbrtoc32() function which works:)

This is the working code:

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"
#include <cassert>
#include <cwchar>
#include <uchar.h>

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::string str;
    std::cin >> str;
    //For example, the input string is "hello☺"

    std::mbstate_t state{}; // zero-initialized to initial state
    char32_t c32;
    const char *ptr = str.c_str(), *end = str.c_str() + str.size() + 1;

    icu::UnicodeString ustr;

    while(std::size_t rc = mbrtoc32(&c32, ptr, end - ptr, &state))
    {
      icu::UnicodeString temp((UChar32)c32);
      ustr+=temp;
      assert(rc != (std::size_t)-3); // no surrogates in UTF-32
      if(rc == (std::size_t)-1) break;
      if(rc == (std::size_t)-2) break;
      ptr+=rc;
    }

    std::cout << "Unicode string is: " << ustr << std::endl;
    std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
    std::cout << "Individual characters of the string are:" << std::endl;
    for(int i=0; i < ustr.countChar32(); i++)
      std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

    return 0;
}

The output on entering input hello☺ is as expected:

Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺

dashthird
  • 47
  • 9