Conversion from ASCII to Unicode char code (FreeType2)

Question

I'm using FreeType2 in one of my projects. In order to render a letter, I need to provide a Unicode two-byte character code. The char codes a program reads are in ASCII one-byte format though. It poses no problem for char codes below 128 (the character codes are the same), but the other 128 do not match. For instance:

'a' in ASCII is 0x61, 'a' in Unicode is 0x0061 - that's fine
'ą' in ASCII is 0xB9, 'ą' in Unicode is 0x0105 - completely different

I was trying to use WinAPI functions there, but I must be doing something wrong. Here's a sample:

unsigned char szTest1[] = "ąółź"; //ASCII format
wchar_t* wszTest2;
int size = MultiByteToWideChar(CP_UTF8, 0, (char*)szTest1, 4, NULL, 0);
printf("size = %d\n", size);
wszTest2 = new wchar_t[size];
MultiByteToWideChar(CP_UTF8, 0, (char*)szTest1, 4, wszTest2, size);
printf("HEX: %x\n", wszTest2[0]);
delete[] wszTest2;

I'm expecting a new wide string to be created, with no NULL at the end. However, the size variable always equals 0. Any idea what I'm doing wrong? Or maybe there's an easier way to solve the problem?

THERE ARE NO [ASCII CODES](https://en.wikipedia.org/wiki/ASCII) ABOVE 127! BY DEFINITION! Sorry for shouting. Now that I've got your attention, you need to find out what the *actual* encoding of your "ascii text" is and actually *use that encoding to decode it*. — Joachim Sauer, Oct 16 '12 at 14:37

score 6 · Accepted Answer · edited May 23 '17 at 12:31

6

The CodePage parameter to MultiByteToWideChar is wrong. Utf-8 is not the same as ASCII. You should be using CP_ACP which tells is the current system code page (which is not the same as ASCII - see Unicode, UTF, ASCII, ANSI format differences)

Size is zero most likely because your test string is not a valid Utf-8 string.

For almost all Win32 functions you can call GetLastError() after the function fails to get the detailed error code, so calling that would give you more details as well.

edited May 23 '17 at 12:31

Community

1
1

answered Oct 16 '12 at 14:40

shf301

31,086
2
52
86

Yes, the test string is most definitely not valid UTF-8: 0xB9 is not a valid leading byte. – R. Martinho Fernandes Oct 16 '12 at 14:44
Yes indeed, there was an error. It was ERROR_NO_UNICODE_TRANSLATION. After changing CP_UTF8 to CP_ACP the _size_ variable increased to 4, as suspected. Moreover, the correct char codes were returned, so that's great! Also thanks for the list of differences between encoding formats. Will come in handy. Cheers – Tomalla Oct 16 '12 at 15:07

Mr.C64 · Answer 2 · 2014-09-18T09:44:40.050

The "pure" ASCII set of characters is restricted in range 0-127 (7 bits). The 8-bit characters with most significant bit set (i.e. those in range 128-255) are not uniquely defined: their definition depends on the code page. So, your character ą (LATIN SMALL LETTER A WITH OGONEK) is represented by the value 0xB9 in a particular code page, which should be Windows-1250. In other code pages, the value 0xB9 is associated to a different character (for example, in Windows 1252 code page, 0xB9 is associated to character ¹, i.e. a superscript digit 1).

To convert characters from a particular code-page to Unicode UTF-16 using Windows Win32 APIs, you can use MultiByteToWideChar, specifying the correct code page (which is not CP_UTF8 as written in the code in your question; in fact, CP_UTF8 identifies Unicode UTF-8). You may want to try specifying 1250 (ANSI Central European; Central European (Windows)) as proper code page identifier.

If you can have access to ATL in your code, you can use the convenience of ATL string conversion helper classes like CA2W, which wraps the MultiByteToWideChar() call and memory allocation in a RAII class; e.g.:

#include <atlconv.h> // ATL String Conversion Helpers
// 'test' is a Unicode UTF-16 string.
// Conversion is done from code-page 1250
// (ANSI Central European; Central European (Windows))
CA2W test("ąółź", 1250);

Now you should be able to use test string in your Unicode API's.

If you don't have access to ATL or want a C++ STL-based solution, you may want to consider some code like this:

///////////////////////////////////////////////////////////////////////////////
//
// Modern STL-based C++ wrapper to Win32's MultiByteToWideChar() C API.
//
// (based on http://code.msdn.microsoft.com/windowsdesktop/C-UTF-8-Conversion-Helpers-22c0a664)
//
///////////////////////////////////////////////////////////////////////////////

#include <exception>    // for std::exception
#include <iostream>     // for std::cout
#include <ostream>      // for std::endl
#include <stdexcept>    // for std::runtime_error
#include <string>       // for std::string and std::wstring
#include <Windows.h>    // Win32 Platform SDK

//-----------------------------------------------------------------------------
// Define an exception class for string conversion error.
//-----------------------------------------------------------------------------
class StringConversionException 
    : public std::runtime_error
{
public:
    // Creates exception with error message and error code.
    StringConversionException(const char* message, DWORD error)
        : std::runtime_error(message)
        , m_error(error)
    {}

    // Creates exception with error message and error code.
    StringConversionException(const std::string& message, DWORD error)
        : std::runtime_error(message)
        , m_error(error)
    {}

    // Windows error code.
    DWORD Error() const
    {
        return m_error;
    }

private:
    DWORD m_error;
};

//-----------------------------------------------------------------------------
// Converts an ANSI/MBCS string to Unicode UTF-16.
// Wraps MultiByteToWideChar() using modern C++ and STL.
// Throws a StringConversionException on error.
//-----------------------------------------------------------------------------
std::wstring ConvertToUTF16(const std::string & source, const UINT codePage)
{
    // Fail if an invalid input character is encountered
    static const DWORD conversionFlags = MB_ERR_INVALID_CHARS;

    // Require size for destination string
    const int utf16Length = ::MultiByteToWideChar(
        codePage,           // code page for the conversion
        conversionFlags,    // flags
        source.c_str(),     // source string
        source.length(),    // length (in chars) of source string
        NULL,               // unused - no conversion done in this step
        0                   // request size of destination buffer, in wchar_t's
        );
    if (utf16Length == 0) 
    {
        const DWORD error = ::GetLastError();
        throw StringConversionException(
            "MultiByteToWideChar() failed: Can't get length of destination UTF-16 string.",
            error);
    }

    // Allocate room for destination string
    std::wstring utf16Text;
    utf16Text.resize(utf16Length);

    // Convert to Unicode UTF-16
    if ( ! ::MultiByteToWideChar(
        codePage,           // code page for conversion
        0,                  // validation was done in previous call
        source.c_str(),     // source string
        source.length(),    // length (in chars) of source string
        &utf16Text[0],      // destination buffer
        utf16Text.length()  // size of destination buffer, in wchar_t's
        )) 
    {
        const DWORD error = ::GetLastError();
        throw StringConversionException(
            "MultiByteToWideChar() failed: Can't convert to UTF-16 string.",
            error);
    }

    return utf16Text;
}

//-----------------------------------------------------------------------------
// Test.
//-----------------------------------------------------------------------------
int main()
{
    // Error codes
    static const int exitOk = 0;
    static const int exitError = 1;

    try 
    {
        // Test input string:
        //
        // ą - LATIN SMALL LETTER A WITH OGONEK
        std::string inText("x - LATIN SMALL LETTER A WITH OGONEK");
        inText[0] = 0xB9;

        // ANSI Central European; Central European (Windows) code page
        static const UINT codePage = 1250;

        // Convert to Unicode UTF-16
        const std::wstring utf16Text = ConvertToUTF16(inText, codePage);

        // Verify conversion.
        //  ą - LATIN SMALL LETTER A WITH OGONEK
        //  --> Unicode UTF-16 0x0105
        // http://www.fileformat.info/info/unicode/char/105/index.htm
        if (utf16Text[0] != 0x0105) 
        {
            throw std::runtime_error("Wrong conversion.");
        }
        std::cout << "All right." << std::endl;
    }
    catch (const StringConversionException& e)
    {
        std::cerr << "*** ERROR:\n";
        std::cerr << e.what() << "\n";
        std::cerr << "Error code = " << e.Error();
        std::cerr << std::endl;
        return exitError;
    }
    catch (const std::exception& e)
    {
        std::cerr << "*** ERROR:\n";
        std::cerr << e.what();
        std::cerr << std::endl;
        return exitError;
    }
    return exitOk;
}

///////////////////////////////////////////////////////////////////////////////

Conversion from ASCII to Unicode char code (FreeType2)

2 Answers2