1

I'm passing GCC a UTF-32 string and it's complaining about an invalid multibyte or wide character.

I tested this in Clang, and I got the same error message.

I wrote the statement originally with MSVC, and it worked alright.

Here's the assert statement.

 assert(utf_string_copy_utf32(&string, U"¿Cómo estás?") == 0);

Here's the declaration.

int utf_string_copy(struct utf_string * a, const char32_t * b);

Here's the compile command:

cc -Wall -Wextra -Werror -Wfatal-errors -g -I ../include -fexec-charset=UTF-32 string-test.c libutf.a -o string-test

Am I to assume that GCC can only recognize Unicode characters by the escape sequences?

Or am I misunderstanding how GCC and CLang recognize these characters.

Edit 1

Here's the error message.

string-test.c: In function ‘test_copy’:
string-test.c:46:61: error: converting to execution character set: Invalid or incomplete multibyte or wide character
assert(utf_string_copy_utf32(&string, U"�C�mo est�s?") == 0);

Edit 2

I'm even more confused now that I've tried to recreate the bug in a smaller example.

#include <uchar.h>
#include <stdlib.h>
#include <stdio.h>

static size_t test_utf8(const char * in){
    size_t len;
    for (len = 0; in[len]; len++);
    return len;
}

static size_t test_utf32(const char32_t * in){
    size_t len;
    for (len = 0; in[len]; len++);
    return len;
}

int main(void){
    size_t len;

    len = test_utf8(u8"¿Cómo estás?");
    printf("utf-32 length: %lu\n", len);

    len = test_utf32(U"¿Cómo estás?");
    printf("utf-32 length: %lu\n", len);

    return 0;
}

This prints:

utf-8 length: 15
utf-32 length: 12

This reaffirms the way I originally thought it worked.

So I guess that means there's a problem somewhere in the library code that I'm using. But I still have no idea what's going on.

tay10r
  • 4,234
  • 2
  • 24
  • 44
  • 1
    Possibly relevant: http://stackoverflow.com/questions/3768363/character-sets-not-clear – hyde Feb 26 '17 at 18:04
  • Your question seems to have morphed from `string-test.c:46:61: error: converting to execution character` to why is the length different -- what is your actual question? – Soren Feb 26 '17 at 18:26
  • My question was why was I getting the error message. I figured it out. I'm writing the answer now. – tay10r Feb 26 '17 at 18:28
  • You probably want to use an editor that can write UTF-8 source, if you're going to use `u8"literals"` – M.M Feb 26 '17 at 19:06
  • 1
    @M.M it doesn't matter whether or not a `u8""` string is encoded as UTF-8 in the source code. It only matters that the source code uses the same encoding throughout the file and that the compiler knows what encoding to expected. For instance, GCC actually does support the Windows-1252 character set, it just needs to be specified on the command line. – tay10r Feb 26 '17 at 20:22

1 Answers1

2

I figured out the issue.

I did a hex dump of both string literals (the string literal that was breaking in the original code and the string literal that was working).

Here's the broken string literal (I wrote this on Windows):

00000000: 5522 bf43 f36d 6f20 6573 74e1 733f 220a  U".C.mo est.s?".

Here's the working string literal (I wrote this on an Ubuntu machine):

00000000: 5522 c2bf 43c3 b36d 6f20 6573 74c3 a173  U"..C..mo est..s
00000010: 3f22 0a                                  ?".

Although they look exactly the same in a code editor, and even though they both have a U prefix, they are encoded differently in the source code.

And while I'm not quite sure which encoding is which, I've taken from it that checking the source code encoding of the literal is very, very important.

Edit 1

As @melpomene pointed out in the comments:

The broken encoding is Windows-1252.

The working encoding is UTF-8.

tay10r
  • 4,234
  • 2
  • 24
  • 44
  • 1
    The broken one is in [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252); the working one is in [UTF-8](https://en.wikipedia.org/wiki/UTF-8). – melpomene Feb 26 '17 at 18:47
  • @melpomene Thanks! – tay10r Feb 26 '17 at 18:50
  • It's not broken, you just have to tell the compiler what the source encoding is. I believe it is with `-finput-charset`. – Mark Tolonen Feb 27 '17 at 00:13
  • I strongly recommend using UTF-8 wherever possible. Modern toolchains should not barf on a BOM either, but you could try putting one in a comment if something does. – Davislor Feb 27 '17 at 06:48