iterating through a char array with non standard chars

Question

Edit: I can only use stdio.h and stdlib.h

I would like to iterate through a char array filled with chars.

However chars like ä,ö take up twice the space and use two elements. This is where my problem lies, I don't know how to access those special chars.

In my example the char "ä" would use hmm[0] and hmm[1].

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{
  char* hmm = "äö";

  printf("%c\n", hmm[0]); //i want to print "ä"

  printf("%i\n", strlen(hmm));

  return 0;
}

Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element. So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?

#include <stdio.h>
#include <stdlib.h>

int main()
{
 char* hmm = "äüö";

  printf("%c\n", hmm[0]);
  printf("%c\n", hmm[1]);
  printf("%c\n", hmm[2]);

  return 0;
}

You are looking for wide characters. See [this answer](http://stackoverflow.com/a/11287282/912144) for a little bit of explanation. `wchar_t` is your keyword for searching for more information. — Shahbaz, Dec 29 '12 at 16:41
If you do not have to use C, I'd use something else with nicer handling of character encodings and strings in general... — hyde, Dec 29 '12 at 16:53

benjarobin · Answer 1 · 2012-12-29T17:01:37.393

4

A char always used one byte.

In your case you think that "ä" is one char: Wrong. Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8

Now the question is do you want to use wide character ?

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>

int main()
{
    const wchar_t hmm[] = L"äö";

    setlocale(LC_ALL, "");
    wprintf(L"%ls\n", hmm);
    wprintf(L"%lc\n", hmm[0]);
    wprintf(L"%i\n", wcslen(hmm));

    return 0;
}

edited Dec 29 '12 at 17:01

answered Dec 29 '12 at 16:43

benjarobin

4,410
27
21

Yes thats what i want to do. I need to be able to iterate through them. – Susan Dec 29 '12 at 16:47
I can only use stdio.h and stdlib.h, however with your example it also shows only "?". – Susan Dec 29 '12 at 16:56
I fixed the example : Without setlocale, the output of wprintf was in UTF-16 or something else (the console doesn't understand it) And why you can only use stdio.h and stdlib.h ? If so you cannot use wide character... – benjarobin Dec 29 '12 at 17:02

score 2 · Answer 2 · edited May 23 '17 at 12:04

Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:

#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(void)
{
    char* hmm = "äö";
    int off = 0;
    int len;
    int max = strlen(hmm);

    setlocale(LC_ALL, "");

    printf("<<%s>>\n", hmm);
    printf("%zi\n", strlen(hmm));

    while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
    {
        printf("<<%.*s>>\n", len, &hmm[off]);
        off += len;
    }

    return 0;
}

On my Mac, it produced:

<<äö>>
4
<<ä>>
<<ö>>

The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:

<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>

The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.

You can also use wide characters and wide-character printing, as shown in benjarobin's answer..

Thanks, your answer was very helpful and explained the problem really well. — Susan, Dec 29 '12 at 17:35

Chad · Accepted Answer · 2012-12-30T09:42:48.123

Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.

Firstly you can open Character Map, and find that
äö

Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.

The answer, of using
setlocale( LC_ALL, "" )

Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.

Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.

Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.

Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.

setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.

You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.

If it was a multie byte character set problem then there would be a whole list of other issues:

Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.

Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.

To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.

For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.

In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>

// File saved as UTF-8, with characters outside the ASCII range
int main()
{
    // Set the OEM code page to be the default OS code page
    setlocale(LC_ALL, "");

    // äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
    // Thus, requires a lead byte per unicode code point when saving as UTF-8
    char* hmm = "äö";

    printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
    printf("Length:%d\n", strlen(hmm));

    // Convert the UTF-8 String to a wide character
    int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
    LPWSTR lpszW = new WCHAR[nLen];
    MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);

    // Print it
    wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW); 

    // Free the memory
    delete[] lpszW;

    int c = getchar();
    return 0;
}


UTF-8 file string using Windows 1252 code page read as:Ã¤Ã¶
Length:4
wprintf wide character of UTF-8 string: äö

Wow thanks, thats some great information. Learned a lot new stuff. :) — Susan, Dec 30 '12 at 22:59

score 0 · Answer 4 · answered Dec 29 '12 at 16:48

0

i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.

answered Dec 29 '12 at 16:48

Chad

2,938
3
27
38

Thanks, just ran it in eclipse, there it displays it correctly. – Susan Dec 29 '12 at 17:02
happy to hear :) good luck and post any other problems you have that we can help out with. – Chad Dec 29 '12 at 17:04

iterating through a char array with non standard chars

4 Answers4