0

given:

wchar_t* str = L"wide chars";

how get i extract one character at a time in c (not c++)?

for example, I tried

for (int i = 0; i < wcslen(str); i++) {
    printf("%wc\n", str[i]);
}

But only gave me gibberish

alk
  • 69,737
  • 10
  • 105
  • 255
ChaseTheSun
  • 3,810
  • 3
  • 18
  • 16
  • 1
    This is because `%wc` should be `%lc` ([demo](http://ideone.com/05ewa1)). Closing as a typo. – Sergey Kalinichenko Mar 07 '15 at 03:58
  • its still doesn't work if str = "日本語" for example – ChaseTheSun Mar 07 '15 at 04:02
  • 1
    What OS are you using? – rici Mar 07 '15 at 04:22
  • In that case, my answer might help, or you might need something more windows specific. I don't know much about windows, sorry. I'll add some tags. – rici Mar 07 '15 at 04:32
  • 1
    I do know that wchar_t in Windows can only hold codes which fit in the BMP; if you need other planes (i.e. codes greater than U+FFFF), then you'll end up with surrogate pairs, and individual characters from a surrogate pair are not meaningful. – rici Mar 07 '15 at 04:35
  • 日本語 characters are all in BMP. – Mark Tolonen Mar 07 '15 at 06:38
  • It isn't impossible to use Unicode in the console on Windows, but it generally seems to be more trouble than it's worth. Consider using the GUI instead, e.g., MessageBox. – Harry Johnston Mar 07 '15 at 09:15
  • Related: http://stackoverflow.com/q/3780378/694576 http://stackoverflow.com/q/1371012/694576 http://stackoverflow.com/q/18904081/694576 – alk Mar 07 '15 at 10:09

1 Answers1

2

On Linux (Ubuntu), the following worked fine:

#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main() {
  /* See below */
  setlocale(LC_ALL, "");
  wchar_t* str = L"日本語";
  for (int i = 0; i < wcslen(str); i++) {
        printf("U+%04x: %lc\n", str[i], str[i]);
  }
  return 0;
}

The setlocale call is important. Without it, the program will execute in the C locale, in which there is no wide character to multibyte conversion, which is necessary for the %lc format code. setlocale(LC_ALL, ""); causes the process's locale to be set to the defaults defined by the various environment variables.

rici
  • 234,347
  • 28
  • 237
  • 341
  • 1
    Note that this won't work on Windows in the astral plane, because `wchar_t` on Windows is UTF-16. – Dietrich Epp Mar 07 '15 at 04:36
  • @DietrichEpp: Indeed. I wrote that answer before I know that the target OS was Windows. Perhaps I should move my comment about surrogate pairs into the answer. TBH, I don't know if it will work on Windows even without surrogate pairs. – rici Mar 07 '15 at 04:37
  • Code above worked for me in Windows 7 if the source was saved in UTF-8 with BOM, except the characters didn't print because I'm not on Japanese Windows. The codepoints were correct. – Mark Tolonen Mar 07 '15 at 06:41