32

I'd like to use R's gsub to remove all punctuation from a text except for apostrophes. I'm fairly new to regex but am learning.

Example:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))

Current Output (no apostrophe in don't)

[1] "I like to chew gum but dont like bubble gum"

Desired Output (I desire the apostrophe in don't to stay)

[1] "I like to chew gum but don't like bubble gum"
zx8754
  • 52,746
  • 12
  • 114
  • 209
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519

4 Answers4

45
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)

[1] "I like to chew gum but don't like bubble gum"

The above regex is much more straight forward. It replaces everything that's not alphanumeric signs, space or apostrophe (caret symbol!) with an empty string.

Kay
  • 2,702
  • 6
  • 32
  • 48
  • Kay your code does remove the apostrophe. This is what I think you meant `gsub("[^[:alnum:][:space:]'\"]", "", x)` – Tyler Rinker Jan 02 '12 at 07:26
  • I like how straight forward this coding is. – Tyler Rinker Jan 02 '12 at 07:31
  • 5
    +1 -- The idea here points to be the clearest possible solution, in my opinion. Just edit the second line to read `gsub("[^[:alnum:][:space:]']", "", x)` and it's golden. (FWIW, the backslash isn't needed inside the regex). – Josh O'Brien Jan 02 '12 at 07:56
  • of course this answer gets out of whack if your text contains non-ascii characters (e.g. text in multiple scripts) – MichaelChirico May 06 '18 at 02:02
11

You can exclude apostrophes from the POSIX class punct using a double negative:

[^'[:^punct:]]

Code:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)

#[1] "I like to chew gum but don't like bubble gum"

ideone demo

Mariano
  • 6,423
  • 4
  • 31
  • 47
7

Here is an example:

>  gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"
kohske
  • 65,572
  • 8
  • 165
  • 155
  • Exactly what I was hoping for. Way more complicated that I thought it would be. No wonder I was having trouble. I'll really pull apart what you did. Thank you. – Tyler Rinker Jan 02 '12 at 03:49
  • 2
    Finally this would be the simplest way `gsub(".*?($|'|[^[:punct:]]).*?", "\\1", x)`. – kohske Jan 02 '12 at 03:53
  • Thank you for the follow up. It works as well as the first and is simpler to follow. +1 – Tyler Rinker Jan 02 '12 at 04:13
5

Mostly for variety, here's a solution using gsubfn() from the terrific package of the same name. In this application, I just like how nicely expressive the solution it allows is:

library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
       replacement = function(x) ifelse(x == "'", "'", ""), 
       x)
[1] "I like to chew gum but don't like bubble gum"

(The argument engine = "R" is needed here as otherwise the default tcl engine will be used. Its rules for matching regular expressions are slightly different: if it were used to process the string above, for instance, one would need to instead set pattern = "[[:punct:]$|^]". Thanks to G. Grothendieck for pointing out that detail.)

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 2
    One caveat -- for some reason, the character class `[:punct:]`, when used in the `pattern` argument of a `gsubfn()` call, does not match the characters `$`, `|`, or `^` as it would in a call to `gsub()`. I thus had to add them 'by hand'. – Josh O'Brien Jan 02 '12 at 05:48
  • 2
    `gsubfn` uses tcl regular expression by default. Use the argument `engine = "R"` if you wish to use R regular expressions. – G. Grothendieck Jan 13 '12 at 15:00
  • @G.Grothendieck -- Thanks for pointing that out. I've incorporated it in my answer. I had taken the documentation in `?gsubfn`, which states that `pattern: Same as 'pattern' in 'gsub'`, to mean that the patterns should be specified in the same way. Now I see what was meant by that, but wonder whether an additional line there might help. Something like `If engine="R", character strings will be matched as documented by 'help(regex)'. If the default tcl engine is used, patterns will be matched as documented at ...`. In any case, thanks for your work on the package! – Josh O'Brien Jan 15 '12 at 22:55