R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

Question

I'm trying to remove non-alphabet characters from a vector of strings. I thought the [:punct:] grouping would cover it, but it seems to ignore the +. Does this belong to another group of characters?

library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)

string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')

It shouldn't, at least according to http://www.regular-expressions.info/posixbrackets.html — davide, Oct 13 '14 at 20:52
Also here https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html — davide, Oct 13 '14 at 20:54
@davide actually, your second link lists '+' under the `[:punct:]` characters and `grepl('[[:punct:]]', '+')` returns `TRUE`. So in base R regex, at least, '+' is considered a punctuation character. — Matthew Plourde, Oct 13 '14 at 20:55
R regex needs an extra set of "[]" for that argument to a character class to succeed. See `?regex` — IRTFM, Oct 13 '14 at 20:56
@MatthewPlourde Yes, by mistake I wrote "shouldn't" instead of "should" in my first comment... — davide, Oct 13 '14 at 20:58
@BondedDust Obviously by "that" I mean your suggestion to add an extra pair of brackets. His first call does in fact do something, it replaces the period from the second to last string with a space. — Matthew Plourde, Oct 13 '14 at 21:02
Oh. Another non-standard evaluation by a supposedly "helpful" wrapper just ends up confusing us. — IRTFM, Oct 13 '14 at 21:04
@BondedDust: stringi offers a complete rewrite of all string processing functions. Thanks for the use of ICU, it fully conforms to the Unicode standard (POSIX is not the same as Unicode and does not work the same on all platforms), see [my answer below](http://stackoverflow.com/a/26357004/3309529) for more details. — gagolews, Oct 14 '14 at 09:27
I suggest you put something like that in the documentation. I did go to several help pages in the pkg:stringi, and found the Unicode character classes, but if there were worked examples that illustrated approaches to building more complex character class expressions, then I missed them. — IRTFM, Oct 14 '14 at 15:13
@BondedDust: for the current devel version: http://docs.rexamine.com/R-man/stringi/stringi-search-charclass.html — gagolews, Oct 26 '14 at 11:27

hwnd · Answer 1 · 2014-10-15T02:53:57.843

POSIX character classes need to be wrapped inside of a character class, the correct form would be [[:punct:]]. Do not confuse the POSIX term "character class" with what is normally called a regex character class.

This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.

ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"

Although if a locale is in effect, it could alter the behavior of [[:punct:]] ...

R Documentation ?regex states the following: Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation is that of the POSIX locale.

The Open Group LC_TYPE definition for punct says:

Define characters to be classified as punctuation characters.

In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included.

In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space> shall be specified.

However, the stringi package seems to depend on ICU and locale is a fundamental concept in ICU.

Using the stringi package, I recommend using the Unicode Properties \p{P} and \p{S}.

\p{P} matches any kind of punctuation character. That is, it is missing nine of the characters that the POSIX class punct includes. This is because Unicode splits what POSIX considers to be punctuation into two categories, Punctuation and Symbols. This is where \p{S} comes into place ...
```
stri_replace_all_regex(string1, '[\\p{P}\\p{S}]', ' ')
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "
```

Or fallback to gsub from base R which handles this very well.

gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "

Another example of why I have never seen fit to use stringi or stringr. The ordinary R regex is already very clean and "regular". Wrapping it just adds to the capacity for error. — IRTFM, Oct 13 '14 at 21:02
@BondedDust, Actually stringis main advantage is speed. It isn't a wrapper, rather completely rewritten. Unlike stringr which is basically a wrapper as far as I can tell — David Arenburg, Oct 13 '14 at 22:11
Right. I had formed an incorrect impression. Appears it also offers vectorization of the pattern and replacement arguments, but it's not much use to me without better documentation. — IRTFM, Oct 13 '14 at 22:13
I'm using it for speed reasons, this is for a file with 50MM rows. When things work, stringi is ~100x faster than stringr. — screechOwl, Oct 13 '14 at 23:09

gagolews · Answer 2 · 2014-10-14T09:24:03.203

In POSIX-like regex engines, punct stands for the character class corresponding to the ispunct() classification function (check out man 3 ispunct on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), the ispunct() function tests for any printing character except for space or a character for which isalnum() is true. However, in POSIX setting, the details of what characters belong into which class depend on the current locale. So the punct class here will not lead to portable code, see the ICU user guide on C/POSIX Migration for more details.

On the other hand, the ICU library, on which stringi relies, and which fully conforms to the Unicode standard, defines some of the charclasses in its own -- but well-defined and always portable -- way.

In particular, according to the Unicode standard, the PLUS SIGN (U+002B) is of Symbol, Math (Sm) category (and is not a Puctuation Mark (P)).

library("stringi")
ascii <- stri_enc_fromutf32(1:127)
stri_extract_all_regex(ascii, "[[:punct:]]")[[1]]
##  [1] "!"  "\"" "#"  "%"  "&"  "'"  "("  ")"  "*"  ","  "-"  "."  "/"  ":"  ";"  "?"  "@"  "["  "\\" "]"  "_"  "{"  "}" 
stri_extract_all_regex(ascii, "[[:symbol:]]")[[1]]
## [1] "$" "+" "<" "=" ">" "^" "`" "|" "~"

So here you should rather use such character sets as [[:punct:][:symbol:]], [[:punct:]+], or even better [\\p{P}\\p{S}] or [\\p{P}+].

For details on available character classes, check out ?"stringi-search-charclass". In particular, ICU User Guide on UnicodeSet and Unicode Standard Annex #44: Unicode character database maybe of your interest. HTH

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

2 Answers2

Linked

Related