4

I have a C program that currently reads in Chinese text and stores them as type wchar_t. What I want to do is look for a specific character in the text, but I am not sure how to refer to the character in the code.

I essentially want to say:

wchar_t character;

if (character == 个) {
    return 1;
}

else return 0;

Some logic has been omitted, obviously. How would I go about performing such logic on Chinese in C?

Edit: Got it to work. This code compiles with -std=c99, and prints out the character "个".

1 #include <locale.h>
2 #include <stdio.h>
3 #include <wchar.h>
4 
5 
6 int main() {
7         wchar_t test[] = L"\u4E2A";
8         setlocale(LC_ALL, "");
9         printf("%ls", test);
10 }
Alex Hansen
  • 301
  • 2
  • 9
  • Each character has a unique code in the encoding used, so you need to provide that code, for example ascii `if (character == '3')` and `if (character == 51)` are equivalent because `51` is the decimal ascii code for the characeter `'3'`. – Iharob Al Asimi Apr 19 '15 at 01:05
  • [An edit](http://stackoverflow.com/revisions/29724599/3) has already pointed out the **=** vs **==** difference, which you should apply in pseudocode. Additionally: pay attention to consistency on your return values. If `false` is available and you are [using stdbool.h](http://stackoverflow.com/questions/4767923/c99-boolean-data-type) then tag your question [c99](http://stackoverflow.com/questions/tagged/c99)...either do 0/1 or false/true, a mix just confuses the *[(already very confusing)](http://www.joelonsoftware.com/articles/Unicode.html)* landscape of unicode further...! – HostileFork says dont trust SE Apr 19 '15 at 01:07
  • Thanks, I fixed the return inconsistency. That was my fault of being lazy on the pseudocode and switching back and forth from c++. I will look at the unicode options now. – Alex Hansen Apr 19 '15 at 01:17

2 Answers2

4

Depending on your compiler, if it allows source in a supported Unicode encoding, you can just compare against the actual symbol, otherwise, you can use a wide character constant:

#include <stdio.h>

int main()
{
    int i;
    wchar_t chinese[] = L"我不是中国人。";
    for(i = 0; chinese[i]; ++i)
    {
        if(chinese[i] == L'不')
            printf("found\n");
        if(chinese[i] == L'\u4E0D')
            printf("also found\n");
    }
}

Note a wide character string is L"xxx" while a wide character is L'x'. A Unicode BMP code point can be specified with \uXXXX.

FYI, I compiled with Visual Stdio 2012 with source encodings of UTF-8 with BOM, UTF-16 (little endian) and UTF-16 (big endian). UTF-8 without BOM did not work.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • This method worked. I had to tweak it a bit, as I am writing in c and not c++. I had to add the "-std=c99" compiler flag and use "\uxxxx" instead of '\uxxxx', but I got it to work, thanks. – Alex Hansen Apr 19 '15 at 01:39
  • 1
    'Tis a little curious that an answer in C++ is accepted for a question about C. However, the character handling (as opposed to the loop and I/O functions) are basically the same in both. – Jonathan Leffler Apr 19 '15 at 01:40
  • @AlexHansen: I suggest putting the 'working code' into an edit in the question. Code in comments is not readily readable. – Jonathan Leffler Apr 19 '15 at 01:41
  • @JonathanLeffler As the unicode method worked (with some added compiler flags and different type of quotes), it was satisfactory to me. And yes, I will edit that now. – Alex Hansen Apr 19 '15 at 01:41
  • Oops! Sorry about C++ vs. C. Wasn't paying attention and usually work in C++. I'll update with C. It still works in VS2012 with `\uxxxx`. – Mark Tolonen Apr 19 '15 at 01:54
  • UTF-8 without BOM requires `/utf-8` flag for MSVC 2019 and older – Dúthomhas Oct 16 '22 at 01:38
0

Thanks to the above explanations. I am now able to write the following code that works on Mac M1 (MacOS Monterey):

// To run this program:
// $ gcc -o test test_Chinese.c
// $ ./test

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int len(wchar_t *str) {
    int i;
    while (str[i]) { i++; }
    printf("i=%d\n", i);
    return i;
}
int main() {
    setlocale(LC_CTYPE, ""); // need this for wprintf()
    wchar_t str[] = L"國民、国民。평화、平和。";
    for(int i = 0; str[i]; ++i) {
        if(str[i] == L'民')       // note: single quote for one character
            printf("found at %d; ", i);
        if(str[i] == L'\u6C11')  // UTF-32-BE Big Endian: 0x00006c11 of L'民'
            printf("also found at %d\n", i);
    }
    wchar_t star1 = 0x2606;    // or L'\u2606'
    wchar_t star2 = L'\u2605'; // or 0x2605;
    wprintf(L"Black Star: %lc\n", star1); // here using printf() will have no output
    wprintf(L"White Star: %lc\n", star2);
    wprintf(L"multi-lingual string: '%ls'\n", str);
    printf("length of 多國語文str: %d; str[2]: '%lc'\n", len(str), str[2]);
    // next line leads to errors
    //wprintf("length of 多國語文str: %d、str[2]:%lc\n", len(str), str[2]);
}

But I haven't figured out when to use wprintf() or printf().

Sam Tseng
  • 178
  • 6