9

I find this really odd :

pattern <- "[[:punct:][:digit:][:space:]]+"
string  <- "a . , > 1 b"

gsub(pattern, " ", string)
# [1] "a b"

library(stringr)
str_replace_all(string, pattern, " ")
# [1] "a > b"

str_replace_all(string, "[[:punct:][:digit:][:space:]>]+", " ")
# [1] "a b"

Is this expected ?

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167

1 Answers1

8

Still working on this, but ?"stringi-search-charclass" says:

Beware of using POSIX character classes, e.g. ‘[:punct:]’. ICU User Guide (see below) states that in general they are not well-defined, so may end up with something different than you expect.

In particular, in POSIX-like regex engines, ‘[:punct:]’ stands for the character class corresponding to the ‘ispunct()’ classification function (check out ‘man 3 ispunct’ on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), the ‘ispunct()’ function tests for any printing character except for space or a character for which ‘isalnum()’ is true. However, in a POSIX setting, the details of what characters belong into which class depend on the current locale. So the ‘[:punct:]’ class does not lead to portable code (again, in POSIX-like regex engines).

So a POSIX flavor of ‘[:punct:]’ is more like ‘[\p{P}\p{S}]’ in ‘ICU’. You have been warned.

Copying from the issue posted above,

string  <- "a . , > 1 b"
mypunct <- "[[\\p{P}][\\p{S}]]" 
stringr::str_remove_all(string, mypunct)

I can appreciate stuff being locale-specific, but it still surprises me that [:punct:] doesn't even work in a C locale ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 2
    thanks, so to be safe I'm using `str_replace_all(string, "[[\\p{P}][\\p{S}]\\d\\s]+"," ")` , and from '?stringi::``stringi-search-regex``' I get `\p{UNICODE PROPERTY NAME} : Match any character with the specified Unicode Property.` and https://en.wikipedia.org/wiki/Unicode_character_property teaches me that `P` is for punctuation and `S` is for symbols. – moodymudskipper Nov 02 '18 at 14:20
  • 1
    `So a POSIX flavor of ‘[:punct:]’ is more like ‘[\p{P}\p{S}]’ in ‘ICU’`, so important to note that they are still not equivalent, for instance the `€`& `$` symbols are removed by `mypunct` using `str_replace_all`, but not by `gsub` with `[:punct:]` – moodymudskipper Nov 02 '18 at 14:33