1

I am having trouble with [:punct:] in R regex. To my mind pat1 and pat2, on this narrow example, should produce identical results.

library(stringr)

test <- " <= 17 "

pat1 <- "[[:punct:]]+"
str_extract(test, pat1)
# [1] NA

pat2 <- "[[\\=\\<\\>]]+"
str_extract(test, pat2)
# [1] "<="

I've tested this on three separate R installations, one of which is a formally-managed corporate environment. I see the same situation in all three.

Am I misunderstanding how [:punct:] should be used? Or is there possibly a bug?

MrFlick
  • 195,160
  • 17
  • 277
  • 295
GregA
  • 91
  • 1
  • 1
  • 5
  • The values "=" and "<" aren't considered "punctuation". That class just contains `!"#%&'()*,-./:;?@[]_{}` – MrFlick Jan 25 '20 at 21:20
  • @MrFlick that disagrees with the listing at `base::regex` and the behavior of `grep` and `str_extract` differ on `< = > ^ $ + | ~` and backtick. – Brian Jan 25 '20 at 21:23
  • 2
    @Brian Well, if you want base regex behavior, use base regex. The base behavior uses its own engine. The stringr package uses a different engine. The dup already talks about the differences. Not sure what you are trying to say. – MrFlick Jan 25 '20 at 21:26
  • @MrFlick it does but I think it isn't clear enough, especially since it talks about `stringi` (the underlying engine) and not `stringr` (the implementation most users see). – Brian Jan 25 '20 at 21:27
  • And I think that it isn't clear from the `stringr` team either, since their cheat sheet lists the base-R punctuation behavior! So an issue needs to be filed there. https://github.com/rstudio/cheatsheets/blob/master/strings.pdf – Brian Jan 25 '20 at 21:32
  • 2
    @Brian The underlying regex engine used in `stringi` or `stringr` is **ICU**, and it follows the [Annex C: Compatibility Properties](http://www.unicode.org/reports/tr18/#Compatibility_Properties) recommendations, i.e. `[:punct:]` matches `\p{gc=Punctuation}` (punctuation proper) while most regex implementations that use POSIX character classes handle them the POSIX compliant way (`[\p{gc=Punctuation}\p{gc=Symbol}]`). – Wiktor Stribiżew Jan 25 '20 at 21:48
  • Thanks for responses. I was using the RStudio cheat sheet, which explicitly states that these characters are included... so it does look like there's at least a documentation issue. [https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf](https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf) – GregA Jan 25 '20 at 22:03
  • 1
    For completeness of the post, for anyone who encounters this in future, simplest fix in a `stringr` context is `pat <- "[[:punct:][:symbol:]]+"` – GregA Jan 26 '20 at 01:03

0 Answers0