2

I am trying to help a friend with a project that was supposed to be 1H and has been now 3 days. Needless to say I feel very frustrated and angry ;-) ooooouuuu... I breath.

So the program written in C++ just read a bunch of file and process them. The problem is that my program reads files which are using a UTF-16 encoding (because the files contain words written in different languages) and a simple use to ifstream just doesn't seem to work (it reads and outputs garbage). It took me a while to realise that this was because the files were in UTF-16.

Now I spent literally the whole afternoon on the web trying to find info about READING UTF16 files and converting the content of a UTF16 line to char! I just can't seem to! It's a nightmare. I try to learn about <locale> and <codecvt>, wstring, etc. which I have never used before (I am specialised in graphics apps, not desktop apps). I just can't get it.

This is what I have done so fare (but doesn't work):

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}

That's the maximum I could come up with but it doesn't even work. And it doesn't do anything better. But the problem is that I don't understand what I am doing in the first place anyway.

SO PLEASE PLEASE HELP! This is really driving crazy that I can even read a G*** D*** text file.

On top, my friend uses Ubuntu (I use clang++) and this code needs -stdlib=libc++ which doesn't seem to be supported by gcc on his side (even though he uses a pretty advanced version of gcc, which is 4.6.3 i believe). So I am not even sure using codecvt and locale is a good idea (as in "possible"). Would there be a better (another) option.

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

Thank a lot, I will ever be grateful to you if you help me on this.

bames53
  • 86,085
  • 15
  • 179
  • 244
user18490
  • 3,546
  • 4
  • 33
  • 52
  • You will not lose any information converting UTF-16 to UTF-8. I think your mistake is in thinking that C++ will do this for you. I'm not completely sure of this, but I don't believe it will. In any case I would just hand code a UTF-16 to UTF-8 conversion. It's straightforward, it would certainly take you less than three days. – john Sep 15 '13 at 15:42
  • Well the problem is that rather than reading about UTF-16 I have been stupidly trying to brute-force a solution by copy/pasting some code from the net that I wasn't fully understanding...;-( So are you SURE converting from 16 to 8 will not result in a lose of information? The question is why using UTF-16 for foreign language in the first place then. I was assuming this was necessary because some alphabets have more chars than what you can encode with utf-8? – user18490 Sep 15 '13 at 15:46
  • 2
    Both UTF-16 and UTF-8 are complete encodings of Unicode. I'm sure you will not lose any information. – john Sep 15 '13 at 15:48
  • UTF-16 is likely used because the files come from a Java/DotNET background. Nobody on Unix would think of using UTF-16 for anything. (UTF-8 can in fact represent *more* characters than UTF-16.) – user4815162342 Sep 15 '13 at 15:49
  • @user4815162342 That's true, although none those extra characters are in the Unicode character set. – john Sep 15 '13 at 15:49
  • @john They do come in handy when writing Klingon, though :) – user4815162342 Sep 15 '13 at 15:50
  • 1
    gcc doesn't support the C++11 unicode conversions yet, if you don't want to write them by hand, you will need a library such as boost.locale to be portable. – Cubbi Sep 15 '13 at 18:24

3 Answers3

3

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

No, all UTF-16 data can be losslessly converted to UTF-8. This is probably the best thing to do.


When wide characters were introduced they were intended to be a text representation used exclusively internal to a program, and never written to disk as wide characters. The wide streams reflect this by converting the wide characters you write out to narrow characters in the output file, and converting narrow characters in a file to wide characters in memory when reading.

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

Of course the actual encoding depends on the codecvt facet in the stream's imbued locale, but what the stream does is use the codecvt to convert from wchar_t to char using that facet when writing, and convert from char to wchar_t when reading.


However since some people started writing files out in UTF-16 other people have just had to deal with it. The way they do that with C++ streams is by creating codecvt facets that will treat char as holding half a UTF-16 code unit, which is what codecvt_utf16 does.

So with that explaination, here are the problems with your code:

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

Here's one way to rewrite the above:

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}
bames53
  • 86,085
  • 15
  • 179
  • 244
  • That's a fantastically helpful answer. Well explained, long, complete with code. Thank you so much. It shows me that I know strictly nothing of how this particular part of C++ works and what it does. As much as I find it "geeky" and advanced, it's still very useful to know it's there, but it feels like I will need to take this time to study this, learn it, and digest it. Thanks again. It's VERY appreciated. – user18490 Sep 15 '13 at 22:55
  • I find Unicode and encodings inordinately interesting, which I guess is good because it's hard to know how to deal with them in C++ without understanding the minutia. Unless you're actually doing more serious text processing the easiest thing is just to use UTF-8 everywhere. – bames53 Sep 15 '13 at 23:02
  • I don't know if you'll find it helpful, but here's an explanation on why wchar_t isn't as useful as people hoped: http://stackoverflow.com/a/11107667/365496 – bames53 Sep 15 '13 at 23:07
0

I adapted, corrected and tested Mats Petersson's impressive solution.

int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif
Frank
  • 1,406
  • 2
  • 16
  • 42
  • Your "fix" is clearly not right - I'm not saying my code is correct, but your fix is clearly not right, since encoding 10 bits in 16 bits and then just discarding the other 10 bits would be completely meaningless. – Mats Petersson Apr 02 '16 at 08:46
  • @Mats Petersson, I just preserved 16 bits in UTF16 as you recommended and it works fine. How should I correctly convert or marshal an array of C++ struct CC_STR32 { wchar_t szString[32] ;} to either a C# IntPtr or StringBuilder on Ubuntu Linux 15.10 and Mono version 4.2.1? Thank you. – Frank Apr 03 '16 at 15:25
  • @Mats Petersson, Thank you for your comment. I meant to ask how should I correctly convert or marshal an array of C++ struct CC_STR32 { wchar_t szString[32] ;} to an C# array of IntPtr's on Ubuntu Linux 15.10 and Mono version 4.2.1? – Frank Apr 05 '16 at 00:17
-1

UTF-8 is capable of representing all valid Unicode characters (code-points), which is better than UTF-16 (which covers the first 1.1 million code-points). [Although, as the comment explains, there is no valid Unicode code-points that are beyond the 1.1 million value, so UTF-16 is "safe" for all currently available code-points - and probably for a long time to come, unless we do get extra terrestrial visitors that have a very complex writing language...]

It does this by, when necessary, using multiple bytes/words to store a single code-point (what we'd call a character). In UTF-8, this is marked by the highest bit being set - in the first byte of a "multibyte" character, the top two bits are set, and in the following byte(s) the top bit is set, and the next from the top is zero.

To convert an arbitrary code-point to UTF-8, you can use the code in a previous answer from me. (Yes, that question talks about the reverse of what you are asking for, but the code in my answer covers both directions of conversion)

Converting from UTF16 to "integer" will be a similar method, except for the length of the input. If you are lucky, you can perhaps even get away with not doing that...

UTF16 uses the range D800-DBFF as a first part, which holds 10 bits of data, and then the following item is DC00-DFFF, holding the following 10 bits of data.

Code for 16-bit to follow...

Code for 16-bit to 32-bit conversion (I have only tested this a little bit, but it appears to work OK):

std::vector<int> utf32_to_utf16(int charcode)
{
    std::vector<int> r;
    if (charcode < 0x10000)
    {
    if (charcode & 0xFC00 == 0xD800)
    {
        std::cerr << "Error bad character code" << std::endl;
        exit(1);
    }
    r.push_back(charcode);
    return r;
    }
    charcode -= 0x10000;
    if (charcode > 0xFFFFF)
    {
    std::cerr << "Error bad character code" << std::endl;
    exit(1);
    }
    int coded = 0xD800 | ((charcode >> 10) & 0x3FF);
    r.push_back(coded);
    coded = 0xDC00 | (charcode & 0x3FF);
    r.push_back(coded);
    return r;
}


int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF) | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}
Community
  • 1
  • 1
Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Thank you. For now I am using iconv to convert the file using the system call in the program. That seems to work. Not ideal, but I will learn about utf-16 later... – user18490 Sep 15 '13 at 16:29
  • Unicode is limited to what UTF-16 can handle. They've made that decision, as they have no expectation that those million codepoints will run out in the next millennium. – prosfilaes Sep 15 '13 at 21:38
  • Thank you to both Mats and Bames53 for the very interesting answers and the big effort. – user18490 Sep 15 '13 at 22:57
  • Can someone please explain why this answer is downvoted? If there is something wrong, I'd like to know it... – Mats Petersson Sep 15 '13 at 23:44
  • @Mats Petersson, I just read and adapted your very impressive solution. I found a bug in it. May we chat about this at a time of your convenience. – Frank Apr 01 '16 at 03:34
  • [Both UTF-8 and UTF-16 can represent **all** Unicode characters](http://stackoverflow.com/q/2241348/995714) so your first sentence is incorrect. All valid Unicode characters are in the 17 planes and the plane space will never be extended – phuclv Apr 01 '16 at 03:49
  • @Lưu Vĩnh Phúc, When I convert UTF-16,L"MAX_DEFAULT" to UTF-32 using C++ or DLLImport C# and then convert UTF-32 back to UTF-16 using C++ or DLLImport C# , I get random "junk" characters. How do I make sure to avoid the random junk characters and recover the original multibyte string. My favorite restaurant in Boston, MA is Pho's Pasteur. – Frank Apr 01 '16 at 06:13
  • @Frank that's because you've converted it incorrectly. Correct Unicode texts can be encoded in any Unicode encoding like UTF-8, 16 or 32 without loss of information – phuclv Apr 01 '16 at 06:30
  • @Lưu Vĩnh Phúc, My method of conversion. utf16 to utf 32 is shown above in my answer above, void convert_utf16_to_utf32(UTF16 *input, size_t input_size, UTF32 *output) My method of conversion. utf32 to utf 16 is shown above in Mats Petersson answer, std::vector utf32_to_utf16(int charcode) Could you please tell me what I am doing incorrectly? I like rice noodles with Smithfield Ham. Thank you – Frank Apr 01 '16 at 09:32