Convert ASCII to unicode string in C/C++ (APIless)

Question

I know I can convert ASCII to unicode strings using MultiByteToWideChar but I want an APIless solution. The only difference is that unicode is 2 bytes compared to ASCII, which is 1.

Should be something like the following, but it doesn't work.

The problem is:

void* __malloc(size_t size)
{
   return HeapAlloc(GetProcessHeap(), 0, size); 
}

void __free(void* p)
{
   if (p) HeapFree(GetProcessHeap(), 0, p); 
}

wchar_t* ascii_to_unicode(const char* ascii)
{
    if (!ascii)
        return nullptr;

    size_t len;
    wchar_t* unicode;

    len = strlen(ascii) * 2 + 1;
    if (!(unicode = reinterpret_cast<wchar_t*>(__malloc(len))))
        return nullptr;

    for (size_t i = 0; i < len; i++)
        *unicode++ = static_cast<wchar_t>(*ascii++);

    return unicode;
}

char* unicode_to_ascii(const wchar_t* unicode)
{
    if (!unicode)
        return nullptr;

    size_t len;
    char* ascii;

    len = wcslen(unicode) / 2 + 1;
    if (!(ascii = reinterpret_cast<char*>(__malloc(len))))
        return nullptr;

    for (size_t i = 0; i < len; i++)
        *ascii++ = static_cast<char>(*unicode++);

    return ascii;
}

I wanted to convert the ASCII returned by strdup to my custom get_module_handle function.

char* forwardLib = strdup(address);
char* forwardName = _strchr(forwardLib, '.');
*forwardName++ = 0;

get_module_handle(ascii_to_unicode(forwardLib));

//
void* get_module_handle(const wchar_t* moduleName)
{
#if defined _M_IX86
    PPEB pPEB = reinterpret_cast<PPEB>(__readfsdword(0x30));
#elif defined _M_X64
    PPEB pPEB = reinterpret_cast<PPEB>(__readgsqword(0x60));
#endif

    for (PLIST_ENTRY pListEntry = pPEB->Ldr->InMemoryOrderModuleList.Flink; pListEntry && pListEntry != &pPEB->Ldr->InMemoryOrderModuleList; pListEntry = pListEntry->Flink)
    {
        PLDR_DATA_TABLE_ENTRY pLdrDataTableEntry = CONTAINING_RECORD(pListEntry, LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);

        if (!__wcsicmp(pLdrDataTableEntry->BaseDllName.Buffer, moduleName))
            return pLdrDataTableEntry->DllBase;
    }

    return nullptr;
}

@pmg, added C++ tag, because I like `nullptr` and the casts. — nop, May 22 '20 at 08:05
*APIless solution* -- Does that include cross-platform solutions? You should consider whether you want to get into the weeds of properly doing these conversions yourself. Also, why pointers, and not convert to / from `std::string` and `std::wstring`? Given what you're trying to accomplish, your approach is a simple 1 or 2 line function using `std::copy` or maybe `std::transform`. — PaulMcKenzie, May 22 '20 at 08:10
@PaulMcKenzie, I don't want to use CRT. I'm using `/NODEFAULTLIB`. — nop, May 22 '20 at 08:12
Then why are you using `strlen` and other runtime functions? Second, `std::copy` and `std::transform` are template functions -- they are implemented totally in the header files. Those functions work with character buffers also. — PaulMcKenzie, May 22 '20 at 08:14
@nop Have you tried to use the algorithms, or just making an assumption about it can't be used? The code you posted is frankly very confusing with standard library calls such as `strlen`, non-standard calls such as `__malloc`, and hand-coded loops when library calls could be used. Second, you never mentioned in your question as to what "doesn't work". — PaulMcKenzie, May 22 '20 at 08:45
What unicode standard do you wanna convert too? If UTF-8, then no conversion is needed as all ascii values are valid in UTF-8. — Fredrik, May 22 '20 at 08:51
@PaulMcKenzie, `__malloc` is a wrapper for malloc, but you could use the normal one. I posted it anyway, but it's not the cause here. As I told you, it's a shellcode. I can't use CRT, I can't use anything built-in because it causes relocations. — nop, May 22 '20 at 09:03
@Fredrik, added the execution code. Basically, I want to pass it to `get_module_handle`, while the ASCII is returned by strdup. — nop, May 22 '20 at 09:05
@nop `len = strlen(ascii) * 2 + 1;` -- The string should be terminated with two null bytes, not just one null byte. The second thing is you should be looking at the memory window of the debugger, not what the debugger believes the string should be -- that would give you a much better picture of the actual byte values being stored. — PaulMcKenzie, May 22 '20 at 09:17
@nop: what do you mean by *it is a shell code*? If you are seeking help to write malware, you are off topic here. — chqrlie, May 22 '20 at 10:30
@chqrlie, it's not a malware. It's for a game I'm testing on. And by shellcode, I mean it has no dependencies, so it can be injected into the game as a DLL. I don't think gamecheating is illegal, at least not in my country and I'm just testing, not so much into the games. — nop, May 22 '20 at 10:40

score 1 · Answer 1 · answered May 22 '20 at 10:12

The only difference is that unicode is 2 bytes compared to ASCII, which is 1.

"Unicode" is a standard. What Windows calls "Unicode" in an effort to simplify things is actually the UTF-16LE character encoding specified by the Unicode standard. There are other character encodings such as UTF-32BE, UTF-32LE, UTF-8, and UTF-16BE. For UTF-16 and UTF-32, if there is no endianness specified, a "byte order mark" is typically used to distinguish whether data is big endian (BE; U+005A -> 00 5A) or little endian (LE; U+005A -> 5A 00).

For UTF-8, byte order doesn't matter, but some programs such as Windows Notepad add a byte order mark anyway. Many programs do not like this, and saving UTF-8 XML data in Notepad results in malformed XML because no bytes can appear before the XML prolog. For more information about Unicode, I highly suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Addressing your actual question, all ASCII characters (hex codes 0x00..0x7F) are the same in UTF-16LE, except there is a '\0' byte after the ASCII char:

C       ASCII code  UTF-16LE bytes
'z'     7A          7A 00
'K'     4B          4B 00
'\0'    00          00 00
'\n'    0A          0A 00
'\x7F'  7F          7F 00

Anything less than 0 or greater than 0x7F is not ASCII and requires you to know what the byte represents and the corresponding Unicode code point. For example, here is the same byte 0xB9 as interpreted in various Windows code pages:

Code page  Char  Unicode code point
932        ｹ     U+FF79
1251       №     U+2116
1252       ¹     U+00B9

This is particularly problematic for code pages like 932 where sometimes multiple bytes are required to express a character. Because of this issue, if you're going to avoid MultiByteToWideChar, then your program needs to reject anything that isn't ASCII. Otherwise, you must use MultiByteToWideChar.

score 0 · Answer 2 · answered May 22 '20 at 10:48

You should simplify your approach to the max:

don't use malloc or any system specific avatar.
try and avoid converting the strings, work directly on char arrays and UTF16LE arrays, but write your own functions to do this and inline them.
use automatic arrays if you really need to convert.

Dissecting software to discover flaws is enticing and useful if your goal is to fix them. Exploiting software flaws to cheat on games or beat software protection schemes is a form of malware, especially if you share your achievements. Use your skills to produce valuable software instead.

Convert ASCII to unicode string in C/C++ (APIless)

2 Answers2