Convert const char* to UTF16 from C on macOS and Windows?

Question

My attempts seem hacky and overly convoluted. Is there a simple way to convert ASCII to UTF16 on Windows and macOS?

(note that the prUTF16Char I can't change )

Attempt (written via https://stackoverflow.com/a/54376330)

Prelude

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#if defined(__APPLE__) && defined(__MACH__)
#include <xcselect.h>
#include <wchar.h>
#include <CoreFoundation/CoreFoundation.h>
typedef unsigned short int prUTF16Char;
#else
typedef wchar_t prUTF16Char;
#endif

#define WIDEN2(x) L ## x
#define WIDEN(x) WIDEN2(x)
#define PROJECT_NAME "foo"

Functions

void copy2ConvertStringLiteralIntoUTF16(const wchar_t* inputString, prUTF16Char* destination) {
    size_t length = wcslen(inputString);
#if (defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)) && defined(PLUGIN_MODE)
    wcscpy_s(destination, length + 1, inputString);
#elif defined(__APPLE__) && defined(__MACH__)
    CFRange range = {0, 150}; range.length = length;
    CFStringRef inputStringCFSR = CFStringCreateWithBytes(
        kCFAllocatorDefault, reinterpret_cast<const UInt8 *>(inputString),
        length * sizeof(wchar_t), kCFStringEncodingUTF32LE, false);
    CFStringGetBytes( inputStringCFSR, range, kCFStringEncodingUTF16, 0, false,
                      reiterpret_cast<UInt8 *>(destination), length * (sizeof (prUTF16Char)), NULL);
    destination[length] = 0; // Set NULL-terminator
    CFRelease(inputStringCFSR);
#endif
}

const prUTF16Char * to_wchar(const char* message) {
    const size_t cSize = strlen(message);
    wchar_t *w_str = new wchar_t[cSize];
#if defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
    size_t outSize;
    mbstowcs_s(&outSize, w_str, cSize, message, cSize-1);
    return w_str;
#else
    mbstowcs(w_str, message, cSize);
#endif
#if defined(__APPLE__) && defined(__MACH__)
    prUTF16Char *ut16str = new prUTF16Char[cSize];
    copy2ConvertStringLiteralIntoUTF16(w_str, ut16str);
    return ut16str;
#else
    return w_str;
#endif
}

Then I can just define a global var:

static const prUTF16Char* PROJECT_NAME_W =
#if defined(__APPLE__) && defined(__MACH__)
    to_wchar
#elif defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
    WIDEN
#endif
        (PROJECT_NAME);

And the body of a generic print function taking message:

#if WCHAR_UTF16
wprintf(L"%s",
#else
    printf("%ls\n",
#endif
    message);

Full attempt:

https://github.com/SamuelMarks/premiere-pro-cmake-plugin/blob/f0d2278/src/common/logger.cpp [rewriting from C++ to C]

Error:

error: initializer element is not a compile-time constant

EDIT: Super hacky, but with @barmak-shemirani's solution I can:

#if defined(__APPLE__) && defined(__MACH__)
extern
#elif defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
static
#endif
const prUTF16Char* PROJECT_NAME_W
#if defined(__APPLE__) && defined(__MACH__)
    ;
#elif defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
    WIDEN(PROJECT_NAME);
#endif

…and only initialise and free on the extern variant.

This is tagged as C but includes `new`, did you just forget to replace it? — Barmak Shemirani, Sep 25 '21 at 08:20
"Convert ASCII to [Unicode]" is confused; ASCII is already a subset of Unicode. Can you please [edit] to explain in more detail what the code should do? Trivially, a pure-ASCII string `"hello"` corresponds to `"h\x00e\x00l\x00l\x00o\x00"` in UTF-16 (though the null bytes will obviously be problematic in regular C strings ... one of the many reasons to prefer https://utf8everywhere.org/) — tripleee, Sep 25 '21 at 08:23
@BarmakShemirani - Yeah that and the casts I was still converting from C++ when I realised that this all seemed wayyyy too convoluted. @tripleee Also I'm working in C90 so I can't actually use UTF8 everywhere… not to mention that I'm conforming to someone else's API [Adobe's] and need to accept regular `const char*` input in some places (that I need to convert to the unicode variant used by the API) — Samuel Marks, Sep 25 '21 at 22:03

Barmak Shemirani · Answer 1 · 2021-09-26T05:01:28.347

1

message includes the null terminating character. strlen does not count this last character, cSize has to increase by 1.

Usually you need to call setlocal if for example message was typed in non-English computer. But it's okay if message is guaranteed to be ASCII.

Windows Example:

const wchar_t* to_wchar(const char* message) 
{ 
    const size_t cSize = strlen(message) + 1;
    //wchar_t* w_str = new wchar_t[cSize]; using C++?
    wchar_t* w_str = malloc(cSize * sizeof(wchar_t));

    size_t outSize;
    mbstowcs(w_str, message, cSize);
    // or mbstowcs_s(&outSize, w_str, cSize, message, cSize);

    return w_str;
}

Note that wchar_t is 2 bytes in Windows, and 4 bytes in POSIX. Then UTF-16 has 2 different version, little-endian and big-endian. UTF-16 has 2-bytes per character for ASCII equivalent codes, but 4-bytes for some non-Latin languages.

You should consider UTF-8 output. Most Windows programs are prepared to read UTF-8 from file or network.

Windows byte output for "123":

49 00 50 00 51 00 00 00 <- little-endian
0  49 00 50 00 51 00 00 <- big-endian

Linux byte output from above code (this won't be recognized as UTF-16 by Winodws):

49 00 00 00 50 00 00 00 51 00 00 00 00 00 00 00

You can write your own function if you are 100% certain that the message is ASCII

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
typedef unsigned short prUTF16Char;//remove this line later

prUTF16Char* to_wchar(const char* message)
{
    if (!message) return NULL;

    size_t len = strlen(message);
    int bufsize = (len + 1) * 2;
    char* buf = malloc(bufsize);

    int little_endian = 1;
    little_endian = ((char*)&little_endian)[0];
    memset(buf, 0, bufsize);
    for (size_t i = 0; i < len; i++)
        buf[i * 2 + little_endian ? 0 : 1] = message[i];

    return (prUTF16Char*)buf;
}

prUTF16Char* wstr;
int main()
{
    wstr = to_wchar("ASCII");
    wprintf(L"%s\n", wstr);
    free(wstr);
    return 0;
}

edited Sep 26 '21 at 05:01

answered Sep 25 '21 at 08:20

Barmak Shemirani

30,904
6
40
77

1

Aside: Consider `w_str = malloc(cSize * sizeof(wchar_t));` --> `w_str = malloc(sizeof *w_str * cSize);`. Easier to code right, review and maintain. – chux - Reinstate Monica Sep 25 '21 at 13:17
Thanks, good catch (the off by 1 error). Hmm, wasn't aware of the whole endianness issue for string encoding… but yeah I do need UTF16 because of the really old API I need to conform to (Adobe's, which has the aforementioned `typedef` for UTF16). In terms of the provided solution, I believe that won't resolve the assignment to my global. Does that mean I have to make my global non-static, or is there another way? – Samuel Marks Sep 25 '21 at 22:00
I don't know what you mean about global non-static. You can assume Windows is little-endian (anything on x86/x64, specially old programs). The problem I mention is when you use `mbstowcs` on different systems, the output string could be wrong, otherwise C compiler doesn't care about this. – Barmak Shemirani Sep 25 '21 at 22:46
Also I am confused now how your program works. It's fine if your program is running on Windows and talking to another Window application. – Barmak Shemirani Sep 25 '21 at 22:47
`static const prUTF16Char* PROJECT_NAME_W = to_wchar(PROJECT_NAME);` is what I mean by the global static. (something which doesn't need to be `free`d by me). My project has two modes, 'plugin' and 'standalone'. 'plugin' requires the Adobe SDK and does things like output to the Adobe console. 'standalone' can work without Adobe's SDK—or even C++—and the example code is the API I'm trying to get working. `log_info(message)` should output `message` using stdout on 'standalone' and Adobe-specific API on 'plugin'. (this was working in C++; I'm rewriting in C) – Samuel Marks Sep 26 '21 at 00:12
global variables are already persistent, it doesn't need `static`. You don't need constant here, note this is a pointer being allocated. You can add static const if you want. See the example shown for conversion without dependency on Windows. – Barmak Shemirani Sep 26 '21 at 05:07
Also note, in Windows you can use code pages to write Greek using `char`. That won't be ASCII. It's unlikely that Adobe decided to use UTF16 for no reason. – Barmak Shemirani Sep 26 '21 at 05:11
Thanks that is certainly useful. I've edited my question to include how I would handle a global with your setup. – Samuel Marks Sep 26 '21 at 20:01

Convert const char* to UTF16 from C on macOS and Windows?

Attempt (written via https://stackoverflow.com/a/54376330)

Prelude

Functions

1 Answers1