10

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...

C naively defined all of its character test functions for an int:

int isspace(int ch);

But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.

And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.

So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.

Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?

I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...


** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).


So, my question is:

"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:

1) Sign expansion, and
2) variable-width character issues

After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.

Please note:

No matter what size your char_type is, it's wrong for most character encoding schemes.

This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.

Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...

Thank You

Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.

Community
  • 1
  • 1
Mordachai
  • 9,412
  • 6
  • 60
  • 112
  • 3
    1. Why do you care? 2. Functions from `ctype.h` are not meant for wide characters, those are in `wctype.h`. As for variable width Unicode characters, AFAIK the standard C library has no support for them. You may need to use a library such as ICU for determining traits of such characters. Also, chars are not always 8-bit wide. There are several popular platforms with 16-bit chars. You can determine char size by inspecting the `CHAR_BIT` preprocessor symbol in `limits.h`. – Praetorian Nov 10 '11 at 16:50
  • wchar_t is 16 bits (unsigned I believe), but all flavors of Unicode encodings are multi-byte - i.e. variable length for each character. So more cases fit in 16 bits, many don't - some don't even fit in 32 bits - so no matter what size of character_type you choose, it's bound to be wrong sometimes. – Mordachai Nov 10 '11 at 16:55
  • 2
    As to why care? Because it actually comes up to bite me in International software. I'm debugging an issue right now that comes down to sign expansion of multi-width characters for our Japanese distributor. Everyone should care, because this is a fundamental failing in every string-library I've personally worked with - and most developers don't even realize that the libraries are insufficient, and their code abounds with problems because of the inadequate thinking surrounding this problem. – Mordachai Nov 10 '11 at 16:58
  • 5
    You keep speaking in absolutes about things that are not specified by the standard. `wchar_t` ***is not always*** 16-bit, it is implementation defined. The same is true for its signedness. This applies to `char`s too. And if you're serious about internationalization of your software you should be using a Unicode aware library to handle strings, not the standard C library. The latter is incapable of handling things like surrogate pairs for instance, with any type of Unicode encoding. – Praetorian Nov 10 '11 at 17:04
  • 1
    I am serious about it - so - what's this "unicode aware library" of which you speak? (Also, it is a total cop-out that C/C++ just don't define any of this, effectively pushing back this mess on us programmers - almost giving us tools that work, but not quite - at least not for any Unicode encoding I know of - which is surely the defacto standard that we've all agreed upon in 98% of the computing world, no?) – Mordachai Nov 10 '11 at 17:08
  • 1
    Yes, strings are a mess in C & C++ (and probably all other programming languages). [ICU](http://site.icu-project.org/) is a popular Unicode aware library; I've never used it myself, so I can vouch for how good / bad it actually is. – Praetorian Nov 10 '11 at 17:10
  • @MooingDuck thanks for the shared pain! ;) So in theory I could internally encode every string as 38 bit unsigned characters and be sure that I could pass them around unmolested. But I'm thinking this is inefficient, as well as missing any library to support such a thing (and I'd still have to do conversions on all I/O for Windows APIs, and more general file & stream I/O where this new encoding must be converted. I really would like to have a library that fully supports multibyte Unicode in multiple encodings: UTF-7, UTF-8, UTF-16 and SHIFT-JIS at a minimum, which itself avoids these issues – Mordachai Nov 10 '11 at 17:23
  • @Praetorian I am curious about those popular platforms with 16-bit chars. Could you give a reference? – rodrigo Nov 10 '11 at 17:23
  • @Mordachai: I typed the wrong number, because the real one looks too small. [Unicode is limited to 0x10FFFF](http://en.wikipedia.org/wiki/UTF-32), so you only need 20 bits. – Mooing Duck Nov 10 '11 at 17:28
  • 4
    `char` may not be signed. `int` always is. And `char` may not be 8 bits wide. – Lightness Races in Orbit Nov 10 '11 at 17:28
  • @rodrigo TI DSPs usually have 16-bit chars. Also, I think Blackfin DSPs from Analog Devices have 32-bit chars! – Praetorian Nov 10 '11 at 17:29
  • @MooingDuck: that's a few FFs toomany - unicode codepoints are in range 0x0..0x10FFFF, ie you need 21 bits to represent the whole range; you're right that UTF-32 is a fixed-width character coding - however, in most cases you're actually interested in grapheme clusters (ie user-perceived characters), and you'll have to treat UTF-32 as variable-length – Christoph Nov 10 '11 at 17:32
  • @rodrigo: I've seen several microchips with 16-bit chars, but can't find one now. The PDP-6 and PDP-10 had 36-bit bytes apperently. – Mooing Duck Nov 10 '11 at 17:35
  • 1
    @TomalakGeret'kal: True, at the language-level (or compilation level). But in terms of brass-tax, for whatever platform you're writing software, there is a necessary encoding (e.g. just to use most of the std::string library, you're forced to use some representation - probably one that is convenient to the software you write). For us, or anyone working with Windows desktop apps, we need encodings that the OS easily works with, and read/write various encodings that external software needs (various). So the language can pretend to dodge the issue, but fails in practice.. – Mordachai Nov 10 '11 at 17:46
  • 1
    See [this question](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) for some good discussion of these issues. – Nemo Nov 10 '11 at 17:49
  • 1
    The C language was invented at a time when it was nearly impossible to generate or display a character over 7 bits, so it's not surprising that the legacy functions don't work so well in the modern world. Unicode wasn't invented until over 15 years later. The answer is to use a library that was written with these issues in mind. – Mark Ransom Nov 10 '11 at 18:46
  • You'd have done much better to have just asked your not unreasonable question rather than preamble it with ill-informed and argumentative preamble just to invite comment that does not really get you closer to an answer. – Clifford Nov 10 '11 at 19:53
  • @Clifford: If all I wanted was a simple answer, then yes. But I am pleased that this led to so much more than a one-off answer. Could I benefit from being less argumentative? I'm sure. Maybe I'll be that mature someday ;) – Mordachai Nov 10 '11 at 20:02
  • Ok folks...time to move this to a chat room. Comments are not intended for extended discussion. Thanks – Kev Nov 10 '11 at 22:43
  • @Mordachai: Argument is more powerful if it is accurate! One of the criteria for closing a question on SO is that it "*will likely solicit opinion, debate, arguments, polling, or extended discussion.*" This question qualifies when perhaps it needn't. – Clifford Nov 12 '11 at 09:47

8 Answers8

10

How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...

Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).

IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.

If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • That's wrong. Standard C++ *does* support Unicode, through literals and through the standard library. Also, `char` is defined in such a way to accommodate for that. – wilhelmtell Nov 10 '11 at 17:28
  • One of the biggest issues for us with UTF-32 would be that the OS APIs (Win32 in our case) don't handle those. So we'd have to constantly convert output & input to the OS between UTF-16 (variable width) and UTF-32. [I also expect this is one of those setting oneself up for failure: eventually Unicode will need more than 32 bits, and everyone's code will be broken - so why not just variable-width chars right once & for all and stop dickering around with it 1/2-assedly by all of the various languages?] – Mordachai Nov 10 '11 at 17:36
  • @wilhelmtell: It supports literals, chars of various widths and containers for said chars, but it's not nearly good enough for many purposes. – Mooing Duck Nov 10 '11 at 17:39
  • @Mordachai: Because there is _no_ existing standard that can handle an infinite number of characters (That I've heard of), and the people in charge of standards have agreed that we will never need more than 0x10FFFF – Mooing Duck Nov 10 '11 at 17:41
  • @MooingDuck: UTF-8 (unlike UTF-16) can be extended indefinitely. It's just artificially limited. – Yakov Galka Nov 10 '11 at 17:51
  • 3
    @ybungalobill: the UTF-8 encoding scheme is limited to 31 bits encoded in 6 octets if you adhere to the following restrictions: (1) 0xFE and 0xFF are invalid (2) the sequence length can be determined from the first octet – Christoph Nov 10 '11 at 18:03
  • 1
    @MooingDuck: the hubris in "the powers that be have decided they'll never need more than..." is intense. It's guaranteed to be wrong. Just like 7 bits was more than enough... until it wasn't. The same is true for any fixed-sized: and only getting bigger makes the inefficiencies for the common cases egregious. ;) – Mordachai Nov 10 '11 at 18:27
  • @Mordachai: Agreed, but completely off-topic. When humanity gets that far we'll replace the current unicode formats with something completely unrelated, so there's no way to plan for it now. All you can do is work with what already exists. Which we've already answered for you. – Mooing Duck Nov 10 '11 at 18:31
  • 1
    Actually, I just calculated that with all of the languages and random symbols ever used (including klingon), the Unicode consortium has only allocated about 9.78% of the code points. Since that's from ~2011 years of writing samples, we can extrapolate that the current unicode encoding should hold us over for approximately 18543 more years. – Mooing Duck Nov 10 '11 at 18:36
  • We're getting off-topic (my apologies) - but information grows in an exponential way. If you used that argument to predict the amount of RAM we'd ever need based on the IBM-PC, then we'd never have gone beyond 16 bits. The same thing was thought when 32 bit IP was designed: never need more - massive range - way beyond anyone's wildest expectations.... but we're bumping against 32 bit limits. It's just the nature of information to expand exponentially. – Mordachai Nov 10 '11 at 18:44
  • 1
    @Mordachai: (a) roughly what do you believe the doubling time in years to be for the number of characters in alphabets that it's practically worth supporting in Unicode? (b) when do you anticipate IPv6 addresses running out, and where is your equivalent question asking how to write TCP stacks to deal with the fact that network addresses cannot reasonably be fixed-width? ;-) – Steve Jessop Nov 10 '11 at 18:50
  • @Mordachai: I don't know if language falls into that category. I'd think due to internationalization, widespread literacy, and immediate communication, I'd expect languages to change _slower_ than in the past. I admit it is possible that I am quite mistaken on that. – Mooing Duck Nov 10 '11 at 18:50
  • @Christoph: we don't need both requirements, they can be dropped if we need to extend UTF-8 (alright, we won't really need). – Yakov Galka Nov 10 '11 at 19:12
  • Ok folks...time to move this to a chat room. Comments are not intended for extended discussion. Thanks – Kev Nov 10 '11 at 22:43
7

I think you are confounding a whole host of unrelated concepts.

First off, char is simply a data type. Its first and foremost meaning is "the system's basic storage unit", i.e. "one byte". Its signedness is intentionally left up to the implementation so that each implementation can pick the most appropriate (i.e. hardware-supported) version. It's name, suggesting "character", is quite possibly the single worst decision in the design of the C programming language.

The next concept is that of a text string. At the foundation, text is a sequence of units, which are often called "characters", but it can be more involved than that. To that end, the Unicode standard coins the term "code point" to designate the most basic unit of text. For now, and for us programmers, "text" is a sequence of code points.

The problem is that there are more codepoints than possible byte values. This problem can be overcome in two different ways: 1) use a multi-byte encoding to represent code point sequences as byte sequences; or 2) use a different basic data type. C and C++ actually offer both solutions: The native host interface (command line args, file contents, environment variables) are provided as byte sequences; but the language also provides an opaque type wchar_t for "the system's character set", as well as translation functions between them (mbstowcs/wcstombs).

Unfortunately, there is nothing specific about "the system's character set" and "the systems multibyte encoding", so you, like so many SO users before you, are left puzzling what to do with those mysterious wide characters. What people want nowadays is a definite encoding that they can share across platforms. The one and only useful encoding that we have for this purpose is Unicode, which assigns a textual meaning to a large number of code points (up to 221 at the moment). Along with the text encoding comes a family of byte-string encodings, UTF-8, UTF-16 and UTF-32.

The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don't have a useful 21-bit data type).

Performing this transformation is already outside the scope of the C++ standard (even the new one), so we need a library to do this. Since we don't know anything about our "system's character set", we also need the library to handle that.

One popular library of choice is iconv(); the typical sequence goes from input multibyte char* via mbstowcs() to a std::wstring or wchar_t* wide string, and then via iconv()'s WCHAR_T-to-UTF32 conversion to a std::u32string or uint32_t* raw Unicode codepoint sequence.

At this point our journey ends. We can now either examine the text codepoint by codepoint (which might be enough to tell if something is a space); or we can invoke a heavier text-processing library to perform intricate textual operations on our Unicode codepoint stream (such as normalization, canonicalization, presentational transformation, etc.). This is far beyond the scope of a general-purpose programmer, and the realm of text processing specialists.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • 2
    "the realm of text processing specialists" - true. And somewhat depressing, that the CS101 standard, "reverse a string" is beyond the knowledge of a typical professional programmer... – Steve Jessop Nov 10 '11 at 17:56
  • @SteveJessop: I think that's just a testament to the richness of human writing, and thus the human mind. It's *very* hard to capture that digitally! But we've only been at it for just over a decade, so I think we're not doing too badly. Gutenberg would be proud! – Kerrek SB Nov 10 '11 at 17:57
  • ish. Unicode sort of set out to provide a common encoding that everyone can use. But the actual result is something that very few people can use *correctly*. I don't think that makes it a failure, as you say it's more a case of "you're gonna need a bigger boat". Part of it is just that Unicode is poorly understood (including by me, I'm not claiming I can do this either!), part of it is that people think things "should be easy" that aren't. Actually I think bandying around words like "presentational transformation" is an excellent way to make the hard things sound like they're hard. – Steve Jessop Nov 10 '11 at 18:01
  • 1
    The problem I have with this answer (which is correct as far as it goes) is that it doesn't lend itself efficiently to a practical days' programming for Windows. The OS expects everything to be in UTF-16, but the I/O we require has other needs, and the std C++ library doesn't really handle UTF-16 correctly (e.g. isspace). So I am left wondering: what's a practical approach that I can implement now (which is especially difficult given that I have a massive code base of mixed 7-bit, 8-bit, and 16-bit code that talks to APIs written at various stages of C, and later C++, standards. – Mordachai Nov 10 '11 at 18:34
  • @Mordachai: You simply write clean interfaces and maintain a strong coding discipline. You pick one form to maintain internally, and you only deal with explicit encodings in abstracted, possibly platform-specific leaf parts of your code. – Kerrek SB Nov 10 '11 at 18:45
  • @KerrekSB: That pov makes sense to me when the I/O is relatively well-defined and externalized (clear borders). And File I/O tends to fit that very well. But for a Win32 Desktop App, I find that trying to store things in a format other than what Windows wants is excruciatingly cumbersome. It infects the resources (thousands of strings) that we display to the user, GUI interaction, etc. So choosing an internal representation other than UTF-16 or MS's locale-based MBCS is not very practical for a majority of the code. – Mordachai Nov 10 '11 at 19:10
  • @KerrekSB: Which means that a good practical choice is for us to choose UTF-16 (or current-locale MBCS). But it still leaves me with "how do I determine if the following storage-character is printable or not" - as a purely necessary, practical need in GUI interactions. So I'm down to: get a library to use that handles conversions + basic replacement functions for isspace, isprint, etc. – Mordachai Nov 10 '11 at 19:12
  • 2
    @Mordachai: How about keeping everything internally as `wchar_t*` or `std::wstring`? Then you can use it directly in Win32 (since Windows actually fixes wide strings to be UTF-16 encoded), and you can still use `std::isspace(str, std::locale(""));`. – Kerrek SB Nov 10 '11 at 19:21
  • 1
    +1 for calling out "the single worst decision in the design of the C programming language". – dan04 Nov 11 '11 at 09:12
6

It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don't write it, don't let it stand when you see it. If you train yourself to panic when you do see it, then it's less hard to see.

fgetc (for example) already returns either EOF or a character read as an unsigned char and then converted to int, so there's no sign-extension issue for values from that.

That's trivia really, though, since the standard character macros don't cover Unicode, or multi-byte encodings. If you want to handle Unicode properly then you need a Unicode library. I haven't looked into what C++11 or C1X provide in this regard, other than that C++11 has std::u32string which sounds promising. Prior to that the answer is to use something implementation-specific or third-party. (Un)fortunately there are a lot of libraries to choose from.

It may be (I speculate) that a "complete" Unicode classification database is so large and so subject to change that it would be impractical for the C++ standard to mandate "full" support anyway. It depends to an extent what operations should be supported, but you can't get away from the problem that Unicode has been through 6 major versions in 20 years (since the first standard version), while C++ has had 2 major versions in 13 years. As far as C++ is concerned, the set of Unicode characters is a rapidly-moving target, so it's always going to be implementation-defined what code points the system knows about.

In general, there are three correct ways to handle Unicode text:

  1. At all I/O (including system calls that return or accept strings), convert everything between an externally-used character encoding, and an internal fixed-width encoding. You can think of this as "deserialization" on input and "serialization" on output. If you had some object type with functions to convert it to/from a byte stream, then you wouldn't mix up byte stream with the objects, or examine sections of byte stream for snippets of serialized data that you think you recognize. It needn't be any different for this internal unicode string class. Note that the class cannot be std::string, and might not be std::wstring either, depending on implementation. Just pretend the standard library doesn't provide strings, if it helps, or use a std::basic_string of something big as the container but a Unicode-aware library to do anything sophisticated. You may also need to understand Unicode normalization, to deal with combining marks and such like, since even in a fixed-width Unicode encoding, there may be more than one code point per glyph.

  2. Mess about with some ad-hoc mixture of byte sequences and Unicode sequences, carefully tracking which is which. It's like (1), but usually harder, and hence although it's potentially correct, in practice it might just as easily come out wrong.

  3. (Special purposes only): use UTF-8 for everything. Sometimes this is good enough, for example if all you do is parse input based on ASCII punctuation marks, and concatenate strings for output. Basically it works for programs where you don't need to understand anything with the top bit set, just pass it on unchanged. It doesn't work so well if you need to actually render text, or otherwise do things to it that a human would consider "obvious" but actually are complex. Like collation.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • I think most linux programs use UTF-8 for everything, since most linux libraries take UTF-8, and most programs don't have to do much with it. – Mooing Duck Nov 10 '11 at 18:00
  • @MooingDuck: right, because most programs are only interested in strings of code points, not on anything highly complex. "Words", for example. If someone is wondering how to use `isspace` correctly, and also wondering about Unicode, then they're into territory where UTF-8 doesn't *easily* go. Linux has the fallback that `wchar_t` can represent a Unicode code point, which is at least a start when UTF-8 won't do. – Steve Jessop Nov 10 '11 at 18:05
3

One comment up front: the old C functions like isspace took int for a reason: they support EOF as input as well, so they need to be able to support one more value than will fit in a char. The “naïve” decision was allowing char to be signed—but making it unsigned would have had severe performance implications on a PDP-11.

Now to your questions:

1) Sign expansion

The C++ functions don't have this problem. In C++, the “correct” way of testing things like whether a character is a space is to grap the std::ctype facet from whatever locale you want, and to use it. Of course, the C++ localization, in <locale>, has been carefully designed to make it as hard as possible to use, but if you're doing any significant text processing, you'll soon come up with your own convenience wrappers: a functional object which takes a locale and mask specifying which characteristic you want to test isn't hard. Making it a template on the mask, and giving its locale argument a default to the global locale isn't rocket science either. Throw in a few typedef's, and you can pass things like IsSpace() to std::find. The only subtility is managing the lifetime of the std::ctype object you're dealing with. Something like the following should work, however:

template<std::ctype_base::mask mask>
class Is  //  Must find a better name.
{
    std::locale myLocale;
            //< Needed to ensure no premature destruction of facet
    std::ctype<char> const* myCType;
public:
    Is( std::locale const& l = std::locale() )
        : myLocale( l )
        , myCType( std::use_facet<std::ctype<char> >( l ) )
    {
    }
    bool operator()( char ch ) const
    {
        return myCType->is( mask, ch );
    }
};

typedef Is<std::ctype_base::space> IsSpace;
//  ...

(Given the influence of the STL, it's somewhat surprising that the standard didn't define something like the above as standard.)

2) Variable width character issues.

There is no real answer. It all depends on what you need. For some applications, just looking for a few specific single byte characters is sufficient, and keeping everything in UTF-8, and ignoring the multi-byte issues, is a viable (and simple) solution. Beyond that, it's often useful to convert to UTF-32 (or depending on the type of text you're dealing with, UTF-16), and use each element as a single code point. For full text handling, on the other hand, you have to deal with multi-code-point characters even if you're using UTF-32: the sequence \u006D\u0302 is a single character (a small m with a circumflex over it).

James Kanze
  • 150,581
  • 18
  • 184
  • 329
0

I haven't been testing internationalization capabilities of Qt library so much, but from what i know, QString is fully unicode-aware, and is using QChar's which are unicode-chars. I don't know internal implementation of those, but I expect that this implies QChar's to be varaible size characters.

It would be weird to bind yourself to such big framework as Qt just to use strings though.

j_kubik
  • 6,062
  • 1
  • 23
  • 42
  • 1
    Yeah, it would, especially since we've got code that uses C-library, C++ std:: library, MFC CStrings, and Win32 APIs already! Yeesh - I need a single, genuinely correct & robust string. :) – Mordachai Nov 10 '11 at 17:29
  • QString is easily convertible from and to std::string and std::wstring using localization codecs. Those in turn convert easily to c-strings that work well with win32 API. The only one that i don't know much about are MFC strings, but I am sure that conversion is possible. Anyway why so many different formats? Are you using different libraries/code-pieces in one project? – j_kubik Nov 10 '11 at 18:10
0

You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.

Plan 9 attempts to solve this with a UTF library, and the assumption that all input data is UTF-8. This allows some measure of backwards compatibility with ASCII, so non-compliant programs don't all die, but allows new programs to be written correctly.

The common notion in C, even still is that a char* represents an array of letters. It should instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.

Work with arrays of Rune as you would with char arrays, to do string manipulation, then call runetochar() to re-encode your letters into UTF-8 before you write it out.

Dave
  • 10,964
  • 3
  • 32
  • 54
  • Given the existence of combining marks, for a `Rune` to represent a letter it has to be capable of holding a sequence of code points. – Steve Jessop Nov 11 '11 at 12:18
0

Your preamble argument is somewhat inacurate, and arguably unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.

Development of the C and C++ languages and much of the libraries pre-date the development of Unicode. Also as system's level languages they require a data type that corresponds to the smallest addressable word size of the execution environment. Unfortunately perhaps the char type has become overloaded to represent both the character set of the execution environment and the minimum addressable word. It is history that has shown this to be flawed perhaps, but changing the language definition and indeed the library would break a large amount of legacy code, so such things are left to newer languages such as C# that has an 8-bit byte and distinct char type.

Moreover the variable encoding of Unicode representations makes it unsuited to a built-in data type as such. You are obviously aware of this since you suggest that Unicode character operations should be performed on strings rather than machine word types. This would require library support and as you point out this is not provided by the standard library. There are a number of reasons for that, but primarily it is not within the domain of the standard library, just as there is no standard library support for networking or graphics. The library intrinsically does not address anything that is not generally universally supported by all target platforms from the deeply embedded to the super-computer. All such things must be provided by either system or third-party libraries.

Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.

"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:

1) Sign expansion, and

2) variable-width character issues

isspace() considers only the lower 8-bits. Its definition explicitly states that if you pass an argument that is not representable as an unsigned char or equal to the value of the macro EOF, the results are undefined. The problem does not arise if it is used as it was intended. The problem is that it is inappropriate for the purpose you appear to be applying it to.

After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS

isspace() is not defined for Unicode. You'll need a library designed to use any specific encoding you are using. This question What is the best Unicode library for C? may be relevant.

Community
  • 1
  • 1
Clifford
  • 88,407
  • 13
  • 85
  • 165
  • -1 for demonstrating ignorance of UTF-8. The OP did actually make the proper distinction between a `char` being 8 bits and a "character" being variable width. **A C(++) `char` is not a character!** – dan04 Nov 11 '11 at 09:22
  • @dan04: I did not claim any knowledge of UTF8, I deliberately steered clear of the subject because I knew I'd be on shaky ground; not much call for it in the embedded systems I develop. However you are right, but up to that point he had not even mentioned Unicode, and appeared to be using the terms interchangeable. In context I think it was ambiguous. The pint about a `char` not being a character (but an small integer), is one that should be addressed to Mordachai; he is the one who appears to be attempting to use it that way - or at least railing at the fact that it does not work. – Clifford Nov 11 '11 at 19:44
  • @dan04: I have removed the apparently offending paragraph. The fact that that whole paragraph was entirely unclear is for a comment not an answer. – Clifford Nov 11 '11 at 19:52
  • Further moderated so as not to appear to rise to Mordachai's somewhat argumentative bait, and be more constructive. – Clifford Nov 12 '11 at 09:36
0

The sign extension issue is easy to deal with. You can either use:

  • isspace((unsigned char) ch)
  • isspace(ch & 0xFF)
  • the compiler option that makes char an unsigned type

As far the variable-length character issue (I'm assuming UTF-8), it depends on your needs.

If you just to deal with the ASCII whitespace characters \t\n\v\f\r, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.

But if you need to recognize the extra Unicode space characters \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000, it's a bit more work. You could write a function along the lines of

bool isspace_utf8(const char* pChar)
{
    uint32_t codePoint = decode_char(*pChar);
    return is_unicode_space(codePoint);
}

Where decode_char converts a UTF-8 sequence to the corresponding Unicode code point, and is_unicode_space returns true for characters with category Z or for the Cc characters that are spaces. iswspace may or may not help with the latter, depending on how well your C++ library supports Unicode. It's best to use a dedicated Unicode library for the job.

most strings in practice use a multibyte encoding such as UTF-7, UTF-8, UTF-16, SHIFT-JIS, etc.

No programmer would use UTF-7 or Shift-JIS as an internal representation unless they enjoy pain. Stick with ŬTF-8, -16, or -32, and only convert as needed.

dan04
  • 87,747
  • 23
  • 163
  • 198
  • I appreciate the many thoughtful responses. It has helped me expand my thinking about the issues. I did want to let you know that many programs are written using the current locale's multibyte code page - which to my knowledge includes Shift-JIS (or something very close to it). Our main software in fact is compiled for MBCS, so working with variable char lengths is the norm for us. As it would be if we switched to UTF-16 (native Windows), because that is also a variable width encoding. Which is why it is hard to justify the pain of converting from our current narrow char to wide char... – Mordachai Nov 11 '11 at 14:18