Objective c doesn't like my unichars?

Question

Xcode complaints about "multi-character character contant"'s when I try to do the following:

static unichar accent characters[] = { 'ā', 'á', 'ă', 'à' };

How do you make an array of characters, when not all of them are ascii? The following works just fine

static unichar accent[] = { 'a', 'b', 'c' };

Workaround

The closest work around I have found is to convert the special characters into hex, ie this works:

static unichar accent characters[] = { 0x0100, 0x0101, 0x0102 };

score 18 · Accepted Answer · edited May 23 '17 at 10:31

It's not that Objective-C doesn't like it, it's that C doesn't. The constant 'c' is for char which has 1 byte, not unichar which has 2 bytes. (see the note below for a bit more detail.)

There's no perfectly supported way to represent a unichar constant. You can use

char* s="ü";

in a UTF-8-encoded source file to get the unicode C-string, or

NSString* s=@"ü";

in a UTF-8 encoded source file to get an NSString. (This was not possible before 10.5. It's OK for iPhone.)

NSString itself is conceptually encoding-neutral; but if you want, you can get the unicode character by using -characterAtIndex:.

Finally two comments:

If you just want to remove accents from the string, you can just use the method like this, without writing the table yourself:

-(NSString*)stringWithoutAccentsFromString:(NSString*)s
{
    if (!s) return nil;
    NSMutableString *result = [NSMutableString stringWithString:s];
    CFStringFold((CFMutableStringRef)result, kCFCompareDiacriticInsensitive, NULL);
    return result;
}

See the document of CFStringFold.

If you want unicode characters for localization/internationalization, you shouldn't embed the strings in the source code. Instead you should use Localizable.strings and NSLocalizedString. See here.

Note: For arcane historical reasons, 'a' is an int in C, see the discussions here. In C++, it's a char. But it doesn't change the fact that writing more than one byte inside '...' is implementation-defined and not recommended. For example, see ISO C Standard 6.4.4.10. However, it was common in classic Mac OS to write the four-letter code enclosed in single quotes, like 'APPL'. But that's another story...

Another complication is that accented letters are not always represented by 1 byte; it depends on the encoding. In UTF-8, it's not. In ISO-8859-1, it is. And unichar should be in UTF-16. Did you save your source code in UTF-16? I think the default of XCode is UTF-8. GCC might do some encoding conversion depending on the setup, too...

Technically, literal characters `'a'` are of type `int` in C. — Chris Lutz, Jan 28 '10 at 02:49
Doesnt solve my problem, but thats the closes to correct answer I think I am going to get. (: — corydoras, Jan 29 '10 at 01:12
About your workaround: Be very careful in embedding unicode character in hex! If I remember correctly, the unichar in OS X is UTF16 in the platform endianness. So, your code might not work in both PPC and intel, and/or iPhone. If you only care about one platform that's fine, but you should keep that in mind in case Apple changes the CPU. — Yuji, Jan 29 '10 at 03:22

daniel.gindi · Answer 2 · 2012-04-15T06:37:44.130

7

Or you can just do it like this:

static unichar accent characters[] = { L'ā', L'á', L'ă', L'à' };

L is a standard C keyword which says "I'm about to write a UNICODE character or character set".

Works fine for Objective-C too.

Note: The compiler may give you a strange warning about too many characters put inside a unichar, but you can safely ignore that warning. Xcode just doesn't deal with the unicode characters the right way, but the compiler parses them properly and the result is OK.

edited Apr 15 '12 at 06:37

answered Aug 10 '11 at 08:52

daniel.gindi

3,457
1
30
36

`L'a'` creates a `wchar_t` literal, which is not necessarily the same as a `unichar`. In particular, on OSX as of me writing this, the former is 32-bit and the latter is 16-bit. This may not be an issue in this particular case but is worth being aware of. – robbie_c Jun 13 '13 at 13:09
@robbie_c you are partially correct. L creates a "wide character string", but the nature of "wide character" is undefined by standard, and is varying between plaforms and frameworks. On MS its UTF16, on Linux it generally is UTF32, on iOS specifically, L matches the unichar so it's good to go :-) BTW, UTF16 and 32 are generally compatible, in the 16-bit range. – daniel.gindi Jun 13 '13 at 13:22
on iOS: `NSLog(@"%zd %zd %zd %zd", sizeof(wchar_t), sizeof(unichar), sizeof(L"é"[0]), sizeof("e"[0]));` `4 2 4 1` – robbie_c Jun 13 '13 at 13:58
1

Problem is, "e" is not unicode. A unichar can accept a wchar_t and be fine with it. That's the nature of UTF. – daniel.gindi Jun 13 '13 at 17:56

score 3 · Answer 3 · answered Sep 03 '14 at 02:24

Depending on your circumstances, this may be a tidy way to do it:

NSCharacterSet* accents = 
    [NSCharacterSet characterSetWithCharactersInString:@"āáăà"];

And then, if you want to check if a given unichar is one of those accent characters:

if ([accents characterIsMember:someOtherUnichar])
{
}

NSString also has many methods of its own for handling NSCharacterSet objects.

Objective c doesn't like my unichars?

Workaround

3 Answers3

Linked

Related