3

I have a value, mystring defined below:

mystring <- "! \" # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~"

When I tried to extract all punctuation using string_extract_all function, some punctuation like $ and + could not be extracted. I tried to escape them with a backslash but I would get an error instead.

str_extract_all(mystring, pattern = "[[:punct:]]")
# [[1]]
#  [1] "!"  "\"" "#"  "%"  "&"  "'"  "("  ")"  "*"  ","  "-"  "."  "/"  ":"  ";"  # "?"  "@"  "["  "]"  "_"  "{"  "}"

It works in base grepl though:

grep(pattern = "[[:punct:]]", unlist(strsplit(mystring," ")), value = TRUE)
# [1] "!"  "\"" "#"  "$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "."  "/"  ":"  ";"  "<"  "="  ">"  "?"  "@" 
# [23] "["  "]"  "^"  "_"  "`"  "{"  "|"  "}"  "~" 

Is this a bug in stringr or is there something wrong with my code?

mt1022
  • 16,834
  • 5
  • 48
  • 71
HNSKD
  • 1,614
  • 2
  • 14
  • 25
  • 1
    `[[:punct:]]` is affected by locale. Try `stri_extract_all(mystring, regex = "[\\p{P}\\p{S}]")` instead. see this thread for more: https://stackoverflow.com/questions/26348643/r-regex-with-stringi-icu-why-is-a-considered-a-non-punct-character – mt1022 Jul 13 '17 at 04:15

0 Answers0