0

I have noticed that the character class [:blank:] also matches \v, as demonstrated by the code below. However, that shouldn't be there, per POSIX, should it?

#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;
int main() {
    std::string const text{"\v"};
    cout << (sregex_token_iterator{text.begin(), text.end(), regex{R"((?-m)^([[:blank:]])$)"}} != sregex_token_iterator{});
    cout << (sregex_token_iterator{text.begin(), text.end(), regex{R"((?-m)^([ \t])$)"}} != sregex_token_iterator{}) << '\n';
    // output: 10, but I expected 00
    return 0;
}

Clearly, since this page of Boost doesn't mention all of the character classes that I see listed here, I suspect that Boost regexes are not POSIX-compliant, even if they use some of those named character classes. Well, not even the word POSIX is at that Boost page, so I guess I'm almost answering myself, but I don't feel confident enough.

I haven't checked which of these character fall into [:blank:] and/or [:space:], but I guess some other suprise might be here too:

const auto LF   = "\x0A";
const auto VT   = "\x0B";
const auto FF   = "\x0C";
const auto CR   = "\x0D";
const auto CRLF = "\x0D\x0A";
const auto NEL  = "\xC2\x85";
const auto LS   = "\xE2\x80\xA8";
const auto PS   = "\xE2\x80\xA9";
Enlico
  • 23,259
  • 6
  • 48
  • 102
  • 1
    _However, that shouldn't be there, per POSIX, should it?_ : No it should not be. [not related] : There is no such bug in standard regex library. [Example](https://wandbox.org/permlink/ojXnFY4JZ6jS2Uet). Any specific reason to stick to boost headers? – brc-dd Sep 21 '20 at 19:54
  • 1
    @brc-dd, actually there is one such a reason, [here](https://stackoverflow.com/q/64007157/5825294). – Enlico Sep 22 '20 at 10:04
  • Yeah, in Boost they have extended support for unicode strings. In case of `std::regex` one can only use `std::string` and `std::wstring` directly. Surely, standard library will not work if you are dealing with unicode string. But for ASCII ones it's my recommended option. I am expecting that some support will come with C++23. – brc-dd Sep 22 '20 at 11:14

1 Answers1

1

Update:
Info on specific ways to control the way the Boost Regex engine will work.

The behavior of the engine can be changed to act differently based on the
flag option set.

See: http://boost.sourceforge.net/libs/regex/doc/syntax_option_type.html

Synopsis excerpt:

Type syntax_option type is an implementation specific bitmask type that controls how a regular expression string is to be interpreted.  For convenience note that all the constants listed here, are also duplicated within the scope of class template basic_regex.

namespace std{ namespace regex_constants{

typedef implementation-specific-bitmask-type syntax_option_type;

// these flags are standardized:
static const syntax_option_type normal;
static const syntax_option_type ECMAScript = normal;
static const syntax_option_type JavaScript = normal;
static const syntax_option_type JScript = normal;
static const syntax_option_type perl = normal;
static const syntax_option_type basic;
static const syntax_option_type sed = basic;
static const syntax_option_type extended;
static const syntax_option_type awk;
static const syntax_option_type grep;
static const syntax_option_type egrep;
static const syntax_option_type icase;
static const syntax_option_type nosubs;
static const syntax_option_type optimize;
static const syntax_option_type collate;
// other boost.regex specific options are listed below

} // namespace regex_constants
} // namespace std

It appears that syntax type should also change the behavior of engine matching.
For a specific POSIX behavior, the syntax option type is extended.

See this section for the POSIX extended option information:

http://boost.sourceforge.net/libs/regex/doc/syntax_option_type.html#extended

_____________________-

I don't know if this will change what [[:blank:]] matches
I'm not in a postion to create a test C++ program with the import boost libs
at this point.

Let me know what is found for that class if it is tried by anyone. -Thanks


Original
This is just with my tests, I can only use the Perl option
with my setup as of this date.

It looks like [[:blank:]] matces 18 Unicode (utf-8) codepoints

00 0009    <control-0009>
00 0020    SPACE
00 00A0    NO-BREAK SPACE
00 1680    OGHAM SPACE MARK
00 2000    EN QUAD
00 2001    EM QUAD
00 2002    EN SPACE
00 2003    EM SPACE
00 2004    THREE-PER-EM SPACE
00 2005    FOUR-PER-EM SPACE
00 2006    SIX-PER-EM SPACE
00 2007    FIGURE SPACE
00 2008    PUNCTUATION SPACE
00 2009    THIN SPACE
00 200A    HAIR SPACE
00 202F    NARROW NO-BREAK SPACE
00 205F    MEDIUM MATHEMATICAL SPACE
00 3000    IDEOGRAPHIC SPACE

And 4 (utf-16) codepoints

00 0009    <control-0009>
00 0020    SPACE
00 00A0    NO-BREAK SPACE
00 3000    IDEOGRAPHIC SPACE
  • Which seems to reinforce that Boost's `[:blank:]` is bugged, as it should match only space and (horizontal) tab. – Enlico Sep 21 '20 at 20:29
  • Oh absolutely. On the other hand, there are flags you can set for POSIX that may give you better Boost results. –  Sep 21 '20 at 20:31
  • Can you link something off your answer? – Enlico Sep 21 '20 at 20:31
  • I think you have to set the extended flag, not sure. See http://boost.sourceforge.net/libs/regex/doc/syntax_option_type.html#extended. Over my head really, the filter I used above was for perl and not POSIX. I probably should delete it. –  Sep 21 '20 at 20:35
  • So like `static const syntax_option_type extended;` use _extended_ rx("", extended); Just guessing –  Sep 21 '20 at 20:41
  • Probably including the test file that you allude too, would be good. – Enlico Oct 03 '20 at 09:29
  • Ok, I did but it's unknown if this has an affect. –  Oct 03 '20 at 16:37