3

The thread "Size of wchar_t* for surrogate pair" shows, that the size of memory required to save a wchar_t value may differ as it can take more space to encode some characters (surrogate pair). That leads me to the following question: How do I then navigate along an array of wchar_t values? Because I can now not just increment or decrement the current address by a fixed size of wchar_t.

CORRECTION: By "How do I then navigate along an array of wchar_t values" I meant how you navigate between the code points which may represented by a variable number of wchar_t values.

Community
  • 1
  • 1
Sam
  • 1,301
  • 1
  • 17
  • 26
  • Sorry: is this C or C++ ? – Marco A. Aug 19 '14 at 20:10
  • 2
    @Sam I think you need to go back and reread the answers to that question. The point was that `wchar_t` and `wchar_t*` are different sizes. The size required to store a `wchar_t` will be consistent within your program. – Degustaf Aug 19 '14 at 20:19
  • Perhaps you misread the question you linked; that is only talking about `sizeof(wchar_t)` and `sizeof(wchar_t *)`. It does not talk about how many characters are used to encode surrogate pairs (the OP thought it did but he was mistaken). Can you clarify your question? – M.M Aug 19 '14 at 20:23
  • I think the question is for example the [Han character](http://www.fileformat.info/info/unicode/char/2008a/index.htm) which is using 2 wchar_t characters. So `ptr++` and `ptr--` do not work here, as it only moves 1 wchar_t. How to navigate instead? – wimh Aug 19 '14 at 21:00
  • Wimmel nailed it. Exactly what I meant – Sam Aug 19 '14 at 21:14

4 Answers4

4

Don't use wchar_t to perform manipulations on Unicode strings. Seriously, just don't. As you've already observed, there isn't a one-to-one correspondence between wchar_t objects and Unicode code points. Use a library such as ICU to manipulate Unicode text.

Brian Bi
  • 111,498
  • 10
  • 176
  • 312
  • 1
    +1 for providing practical advice to a situation that is difficult to get right on your own – M.M Aug 19 '14 at 20:26
3

There are multiple issues here and using a library such as ICU will help you avoid a lot of problems. The issue with surrogate characters in UTF-16 is not the only problem if you are trying to count "characters".

If you just have to walk a wchar_t string the values for surrogate values are uniquely defined as a leading value (0xd800 to 0xdbff) followed by a trailing value (0xdc00 to 0xdfff). You can use this knowledge to walk forward or backward through an array counting the "characters". This assumes you have a valid set of values.

Another issue is values in the stream that aren't a character by themselves. For example U+0301 is COMBINING ACUTE ACCENT with adds an accent to the previous value. These can be an issue whether using UTF-8, UTF-16, or UTF-32.

Brian Walker
  • 8,658
  • 2
  • 33
  • 35
0

This answer is clarifying the nature of the wchar_t as a type. That seemed to to have been misunderstood before the question had the "CORRECTION" added.

As with any concrete type, sizeof(wchar_t) is constant for a particular system as is sizeof(wchar_t *).

In language terms you can can navigate an array of wchar_t just the same as you can navigate an array of any other type.

However, working with text characters encoded with varying numbers of wchar_ts is another and much more complex matter. The other answers have to some extent addressed that.

PeterSW
  • 4,921
  • 1
  • 24
  • 35
  • You are right. I did correct my question which was mistakable. – Sam Aug 19 '14 at 21:23
  • I'm not sure why I have been getting down votes here. I've attempted to clarify my answer. I would appreciate any feedback as to why this answer is deserving of down votes? – PeterSW Apr 02 '15 at 17:29
  • I have also no idea why someone did down vote your answer. I gave you an up vote to neutralize the down vote =). – Sam Apr 03 '15 at 19:26
0

The size of wchar_t may be different in various different systems, but it's certain and fixed during run-time or compile-time on a machine.

You can retrieve its size by operator sizeof and also you can iterate over it as same as other types.

The type wchar_t based on a specific locale has a maximum size to store a character. So, the mapping between string's code-units to the text's characters is one-to-one, therefore don't worry about iterating over characters of a wide string as same as other types to read next or previous character. (Unlike Unicode)

However, this is the only bright part of wchar_t strings. Using them as a general way to store any arbitrary string is not easy task. So, you should use Unicode aware things. A related Q&A is here.

Community
  • 1
  • 1
masoud
  • 55,379
  • 16
  • 141
  • 208
  • 1
    I think the answer of M M was helpful to me. Yes, my question may not be so clear at all as the entire thing about encodings is quiet confusing. – Sam Aug 19 '14 at 21:12