Does stringr use a different set of character classes for punctuation?

Question

I was using the [:punct:] regular expression character class, and it seems to me that the stringr package does not define [:punct:] the same way that the base grep does.

> grepl('[[:punct:]]', '^HELLO')
[1] TRUE
> str_detect('^HELLO', '[[:punct:]]')
[1] FALSE

stringr and grep generally agree on some of the basic punctuations (including , and .):

> grepl('[[:punct:]]', '?HELLO')
[1] TRUE
> str_detect('?HELLO', '[[:punct:]]')
[1] TRUE

But not on others such as `, ~ and | and possibly others. Here is a fuller test of [:punct:] below, though I also have not tested other character classes. Unsure whether this is limited to just [:punct:].

library(stringr)
punct <- c(
  ".", ",", ":", ";", "?", "!", "\\", "|", "/", "`", "=","*", "+", "-", "^",
  "_", "~", "\"", "'", "[", "]", "{", "}", "(", ")", "<", ">", "@", "#", "$"
  )
grepl("[[:punct:]]", punct)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [29] TRUE TRUE
str_detect(punct, "[:punct:]")
#>  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
#> [12]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [23]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
punct[which(!str_detect(punct, "[:punct:]"))]
#> [1] "|" "`" "=" "+" "^" "~" "<" ">" "$"

Created on 2018-05-03 by the reprex package (v0.2.0).

It might just be that `str_detect` is treating `^` as regex. If you escape it `str_detect('\\^HELLO', '[[:punct:]]')` yields true — camille, May 03 '18 at 19:21
@camille if you run `str_extract_all('\\^HELLO','[[:punct:]]')` you'll get \\ - it's detecting the escaped \, not the ^ — Mark, May 03 '18 at 19:24
`stringr` runs `stringi` under the hood and I think stringi uses a different regex library than base R. stringi uses [ICU](http://userguide.icu-project.org/strings/regexp) while base R use [TRE](https://laurikari.net/tre/documentation/regex-syntax/), but I know little about regex so I wouldn't take my word for it — David Arenburg, May 03 '18 at 19:32

score 1 · Answer 1 · answered May 03 '18 at 19:37

I don't know why but we can explore how far spread the difference is. We can generate a set of characters that covers the normal latin range.

rawToChar(as.raw(33:126))
#> [1] "!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"

Now, we split these into individual characters and sapply grepl and str_detect to each.

testCase = strsplit(rawToChar(as.raw(33:126)),'')[[1]]
base = sapply(testCase,grepl,pattern="[[:punct:]]")
stringr = sapply(testCase,stringr::str_detect,pattern="[[:punct:]]")

base_punct = names(base)[base]
stringr_punct = names(stringr)[stringr]

setdiff(base_punct,stringr_punct)
#> [1] "$" "+" "<" "=" ">" "^" "`" "|" "~"
setdiff(stringr_punct,base_punct)
#> character(0)

So there are 9 pieces of punctuation that grepl detects but stringr does not. There are not pieces of punctuation that stringr detects that grepl does not. Toggling perl=TRUE has no impact on the result.

The case you found seemed like maybe it was being interpreted as regex, but the fact that ()[]- are detected are evidence against that.

[It has all been explained already](https://stackoverflow.com/questions/26348643/r-regex-with-stringi-icu-why-is-a-considered-a-non-punct-character). — Wiktor Stribiżew, May 03 '18 at 19:42

Does stringr use a different set of character classes for punctuation?

1 Answers1