0

Looking at R regex documentation, [:punct:] includes following characters -

Punctuation characters:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

But when I try to use this in stringr::str_replace_all(), it doesn't seem to detect +s.

str_vec = c("c++", "c--", "c+_")
str_replace_all(str_vec, pattern = "[[:punct:]]", replacement = "_")
[1] "c++" "c__" "c+_"
str_replace_all(str_vec, pattern = "[[:punct:]]{2,}", replacement = "_")
[1] "c++" "c_"  "c+_"

Has it got to do with the locale settings?

Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8"

or is it something else that I'm missing here?

steadyfish
  • 847
  • 2
  • 12
  • 27
  • This is not base Rs regex, but has something to do with the regex that `stringr ` uses: see `gsub("[[:punct:]]", "_", str_vec)`. – lmo May 05 '16 at 15:01
  • 1
    Perhaps this regex pattern is not part of `stringr`'s vocabulary. See `help("stringi-search-regex")` for the list of patterns. – lmo May 05 '16 at 15:08
  • Thanks @lmo. Looking at `help("stringi-search-charclass")`, I could see they are already warning about POSIX `[:punct:]` character class! `".. .. So a POSIX flavor of [:punct:] is more like [\p{P}\p{S}] in ICU. .. .. "` – steadyfish May 05 '16 at 15:38

0 Answers0