1

I am trying to replace all punctuation and "not words" except for "." and "-" in a string, but am struggling to find the right combination to set up the regex expression.

I've been using the following str_replace_all() code in R, but now I want to specify to ignore "." and "-". I've tried setting it up to include things like [^.-] and ([.-]), but I'm not getting the desired output.

str_replace_all("[APPLE/O.ORANGE*PLUM-11]", regex("[\\W+,[:punct:]]", perl=T)," ")

" APPLE O ORANGE PLUM 11 " #current output

" APPLE O.ORANGE PLUM-11 " #desired output

Any suggestions would be greatly appreciated. Thanks!

SC2
  • 313
  • 8
  • 21
  • 1
    Error: could not find function "str_replace_all". You should specify which packages you use when asking about non-base R functions. – IRTFM Feb 01 '17 at 16:33
  • See the [in R, use gsub to remove all punctuation except period](https://stackoverflow.com/a/60514641/3832970) answer with correct solutions. The currently accepted `[^a-zA-Z0-9.-]` will remove a lot of Unicode letters and numbers, not just punctuation. – Wiktor Stribiżew Mar 04 '20 at 09:13

2 Answers2

9

It's probably easier to use the ^, which means that it is matching everything not referenced within the brackets. By including all letters, numbers, ., and - in the box you don't replace those.

library(stringr) 
str_replace_all("[APPLE/O.ORANGE*PLUM-11]", "[^a-zA-Z0-9.-]"," ")
joel.wilson
  • 8,243
  • 5
  • 28
  • 48
be_green
  • 708
  • 3
  • 12
1

Note that str_replace_all does not allow using PCRE patterns, the stringr library is ICU regex powered.

What you need to do can be done with a base R gsub using the following pattern:

> x<-"[APPLE/O.ORANGE*PLUM-11]"
> gsub("[^\\w.-]", " ", x, perl=TRUE)
[1] " APPLE O.ORANGE PLUM-11 "

See the R demo online. Also, see the regex online demo here.

The [^\\w.-] pattern matches any character other than (since [^...] is a negated character class) word char (letter, digit, _), . and -.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563