Can isspace() give false positives with UTF-8 text?

Question

I know isspace() is meant to work for ASCII, but I have UTF-8 text. If isspace() looks only at the lower 7 bits, where UTF-8 and ASCII overlaps, it should be safe to use.

By safe to use I mean that it won't detect a Unicode character that is not a whitespace as whitespace. I know that there might be special Unicode whitespaces which it will not detect, but that is not a problem for me.

I.e. I'm OK with false negatives, so long as there are no positives. Is it correct to assume that?

`is space` is intended to work for any fixed-width single-byte character set, of which ASCII is the most common but by no means the only example. On a system that uses a character set that isn't ASCII it will give appropriate answers for that character set, not for ASCII. — Pete Becker, Jun 25 '16 at 12:08
@PeteBecker Are there modern non-ASCII systems? I've heard about EBCDIC but as far as I understand it, those are from very old times when things hadn't been standardized. — sashoalm, Jun 25 '16 at 12:15
wrong question. If you're make design and coding decisions based on the view that all the world is ASCII you better be able to justify it. Anything else is not engineering. — Pete Becker, Jun 25 '16 at 12:19
@PeteBecker OK, I'll keep that in my mind. The code I need this for will only run on Linux (on a x86 PC), so for my purposes at least I'm sure it will be ASCII. — sashoalm, Jun 25 '16 at 12:23
It is a nonsensical argument, UTF-8 means only one thing and it has squat to do with any legacy 8-bit encoding. It is a Unicode encoding, it was designed to remove ambiguities. If you don't care at all about typographical accuracy then just don't bother at all and use `== ' '`. It will never match part of a utf-8 sequence, the 2nd and subsequent bytes have their MSB turned on. — Hans Passant, Jun 25 '16 at 12:43
There are 2 `isspace` a templated one in `` http://en.cppreference.com/w/cpp/locale/isspace and an untemplated one in `` http://en.cppreference.com/w/cpp/string/byte/isspace see the example in the `` version — Richard Critten, Jun 25 '16 at 12:44
@PeteBecker the example here includes UTF-8: http://en.cppreference.com/w/cpp/locale/isspace — Richard Critten, Jun 25 '16 at 12:47
@HansPassant -- that's a bit strong. The values 0-127 represent the same characters in utf-8 and ASCII. That means that a reader that understands utf-8 can read ASCII text correctly, and that's important and useful. — Pete Becker, Jun 25 '16 at 13:20
@RichardCritten - yes, there is a version of `std::isspace` that can be called with a locale, and there might be a locale on your system that supports utf-8. Nevertheless, the question is clearly about the C version, and that, by default, uses the "C" locale, which uses a default character encoding for the system; that encoding is not required to be ASCII. — Pete Becker, Jun 25 '16 at 15:01
@PeteBecker: there is nothing in the question that states whether the C or C++ version of `isspace()` is being used, or that the `"C"` locale is being used. What you say is *likely* the case, but the OP should clearly the exact function and locale settings being used. — Remy Lebeau, Jun 30 '16 at 21:04

score 2 · Answer 1 · answered Jun 25 '16 at 12:08

2

It maybe safe as there is absolutely no difference between ASCII and utf-8 for code points between 0 and 127.

answered Jun 25 '16 at 12:08

shiva

2,535
2
18
32

2

There's a big difference between, for example, EBCDIC and utf-8, and `isspace` on a system that uses EBCDIC as its native encoding will not give answers that make any sense for either ASCII or utf-8. – Pete Becker Jun 25 '16 at 12:11
But does isspace() check for anything above 127? That was the part I was not certain about. – sashoalm Jun 25 '16 at 12:21
Check [this](https://www.cs.tut.fi/~jkorpela/chars/spaces.html). It returns `True` for all except `U+FEFF`. – shiva Jun 25 '16 at 12:51

Remy Lebeau · Accepted Answer · 2016-06-30T21:09:32.050

isspace() is subject to locale definitions of whitespace characters at runtime.

In C, whitespace characters are defined by the locale specified in a call to setlocale(LC_ALL) or setlocale(LC_CTYPE).

In C++, whitespace characters are defined by the locale specified by either:

a call to std::setlocale(LC_ALL) or std::setlocale(LC_CTYPE), when using the version of std::isspace() from the <cctype> header.
an input locale parameter, when using the version of std::isspace() from the <locale> header.

The default locale used is the "C" locale, which defines the following whitespace characters, which are the same in UTF-8 and ASCII, and most locales that are ASCII-compatible, but may be different in other locales:

' '  (0x20) space (SPC) 
'\t' (0x09) horizontal tab (TAB) 
'\n' (0x0a) newline (LF) 
'\v' (0x0b) vertical tab (VT) 
'\f' (0x0c) feed (FF) 
'\r' (0x0d) carriage return (CR)

Can isspace() give false positives with UTF-8 text?

2 Answers2