Index a character in a wchar_t array

Question

The thread "Size of wchar_t* for surrogate pair" shows, that the size of memory required to save a wchar_t value may differ as it can take more space to encode some characters (surrogate pair). That leads me to the following question: How do I then navigate along an array of wchar_t values? Because I can now not just increment or decrement the current address by a fixed size of wchar_t.

CORRECTION: By "How do I then navigate along an array of wchar_t values" I meant how you navigate between the code points which may represented by a variable number of wchar_t values.

@Sam I think you need to go back and reread the answers to that question. The point was that `wchar_t` and `wchar_t*` are different sizes. The size required to store a `wchar_t` will be consistent within your program. — Degustaf, Aug 19 '14 at 20:19
Perhaps you misread the question you linked; that is only talking about `sizeof(wchar_t)` and `sizeof(wchar_t *)`. It does not talk about how many characters are used to encode surrogate pairs (the OP thought it did but he was mistaken). Can you clarify your question? — M.M, Aug 19 '14 at 20:23
I think the question is for example the [Han character](http://www.fileformat.info/info/unicode/char/2008a/index.htm) which is using 2 wchar_t characters. So `ptr++` and `ptr--` do not work here, as it only moves 1 wchar_t. How to navigate instead? — wimh, Aug 19 '14 at 21:00

score 4 · Answer 1 · answered Aug 19 '14 at 20:14

4

Don't use wchar_t to perform manipulations on Unicode strings. Seriously, just don't. As you've already observed, there isn't a one-to-one correspondence between wchar_t objects and Unicode code points. Use a library such as ICU to manipulate Unicode text.

answered Aug 19 '14 at 20:14

Brian Bi

111,498
10
176
312

1

+1 for providing practical advice to a situation that is difficult to get right on your own – M.M Aug 19 '14 at 20:26

score 3 · Answer 2 · answered Aug 19 '14 at 20:23

There are multiple issues here and using a library such as ICU will help you avoid a lot of problems. The issue with surrogate characters in UTF-16 is not the only problem if you are trying to count "characters".

If you just have to walk a wchar_t string the values for surrogate values are uniquely defined as a leading value (0xd800 to 0xdbff) followed by a trailing value (0xdc00 to 0xdfff). You can use this knowledge to walk forward or backward through an array counting the "characters". This assumes you have a valid set of values.

Another issue is values in the stream that aren't a character by themselves. For example U+0301 is COMBINING ACUTE ACCENT with adds an accent to the previous value. These can be an issue whether using UTF-8, UTF-16, or UTF-32.

PeterSW · Answer 3 · 2015-04-02T17:26:43.487

0

This answer is clarifying the nature of the wchar_t as a type. That seemed to to have been misunderstood before the question had the "CORRECTION" added.

As with any concrete type, sizeof(wchar_t) is constant for a particular system as is sizeof(wchar_t *).

In language terms you can can navigate an array of wchar_t just the same as you can navigate an array of any other type.

However, working with text characters encoded with varying numbers of wchar_ts is another and much more complex matter. The other answers have to some extent addressed that.

edited Apr 02 '15 at 17:26

answered Aug 19 '14 at 20:12

PeterSW

4,921
1
24
35

You are right. I did correct my question which was mistakable. – Sam Aug 19 '14 at 21:23
I'm not sure why I have been getting down votes here. I've attempted to clarify my answer. I would appreciate any feedback as to why this answer is deserving of down votes? – PeterSW Apr 02 '15 at 17:29
I have also no idea why someone did down vote your answer. I gave you an up vote to neutralize the down vote =). – Sam Apr 03 '15 at 19:26

score 0 · Accepted Answer · edited May 23 '17 at 12:06

The size of wchar_t may be different in various different systems, but it's certain and fixed during run-time or compile-time on a machine.

You can retrieve its size by operator sizeof and also you can iterate over it as same as other types.

The type wchar_t based on a specific locale has a maximum size to store a character. So, the mapping between string's code-units to the text's characters is one-to-one, therefore don't worry about iterating over characters of a wide string as same as other types to read next or previous character. (Unlike Unicode)

However, this is the only bright part of wchar_t strings. Using them as a general way to store any arbitrary string is not easy task. So, you should use Unicode aware things. A related Q&A is here.

I think the answer of M M was helpful to me. Yes, my question may not be so clear at all as the entire thing about encodings is quiet confusing. — Sam, Aug 19 '14 at 21:12

Index a character in a wchar_t array

4 Answers4