In POSIX-like regex engines, punct
stands for
the character class corresponding to the ispunct()
classification
function (check out man 3 ispunct
on UNIX-like systems).
According to ISO/IEC 9899:1990 (ISO C90), the ispunct()
function tests
for any printing character except for space or a character for which
isalnum()
is true. However, in POSIX setting, the details of what
characters belong into which class depend on the current locale.
So the punct
class here will not lead to portable code,
see the ICU user guide on C/POSIX Migration
for more details.
On the other hand, the ICU library, on which stringi relies,
and which fully conforms to the Unicode standard,
defines some of the charclasses in its own -- but well-defined
and always portable -- way.
In particular, according to the Unicode standard,
the PLUS SIGN
(U+002B
) is of Symbol, Math
(Sm
) category (and is not a Puctuation Mark
(P
)).
library("stringi")
ascii <- stri_enc_fromutf32(1:127)
stri_extract_all_regex(ascii, "[[:punct:]]")[[1]]
## [1] "!" "\"" "#" "%" "&" "'" "(" ")" "*" "," "-" "." "/" ":" ";" "?" "@" "[" "\\" "]" "_" "{" "}"
stri_extract_all_regex(ascii, "[[:symbol:]]")[[1]]
## [1] "$" "+" "<" "=" ">" "^" "`" "|" "~"
So here you should rather use such character sets
as [[:punct:][:symbol:]]
, [[:punct:]+]
,
or even better [\\p{P}\\p{S}]
or
[\\p{P}+]
.
For details on available character classes, check out
?"stringi-search-charclass"
.
In particular, ICU User Guide on UnicodeSet
and Unicode Standard Annex #44: Unicode character database
maybe of your interest. HTH