16

I have some sentences like this one.

c = "In Acid-base reaction (page[4]), why does it create water and not H+?" 

I want to remove all special characters except for '?&+-/

I know that if I want to remove all special characters, I can simply use

gsub("[[:punct:]]", "", c)
"In Acidbase reaction page4 why does it create water and not H"

However, some special characters such as + - ? are also removed, which I intend to keep.

I tried to create a string of special characters that I can use in some code like this

gsub("[special_string]", "", c)

The best I can do is to come up with this

cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")

However, the following code just won't work

gsub("[cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")]", "", c)

What should I do to remove special characters, except for a few that I want to keep?

Thanks

zx8754
  • 52,746
  • 12
  • 114
  • 209
wen
  • 1,875
  • 4
  • 26
  • 43

3 Answers3

26
gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • This really works. I only know that ^ marks the beginning of a line (and $ marks the end). Why you are using it to mean "keep"? Could you explain a little? – wen Feb 08 '14 at 04:00
  • "^" is the character class negation marker (when it occurs first). Read `?regex`. – IRTFM Feb 08 '14 at 04:43
  • 1
    @user3193265, as IShouldBuyABoat notes, the `^` inside a character range (`[]`) has a different meaning than outside. Several otehr characters have different meanings too. For example, `?+` are not special characters in character ranges, but `-` is (so we had to escape that one). Inside, as the first character, it means negate, or much everything other than what's inside the the expression. If this answers your question, please consider checking it as answered. Thanks. – BrodieG Feb 08 '14 at 13:05
  • Does not seem to work for all special characters, like '•'. The seem to survive... – Fabian Werner Aug 27 '15 at 09:38
  • To also keep "." just add `\\.` inside the character range: `gsub("[^[:alnum:][:blank:]+?&/\\-\\.]", "", c)`. (the \\ is just a n escape character so that regex recognizes "." as text, rather than a . symbol matching any character. – Matt L. Sep 23 '22 at 13:35
7

In order to get your method to work, you need to put the literal "]" immediately after the leading "["

 gsub("[][!#$%()*,.:;<=>@^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"

You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
5

I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).

There's likely a better regex:

x <- "In Acid-base reaction (page[4]), why does it create water and not H+?" 
keeps <- c("+", "-", "?")

## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\", 
    keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)

#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)

## [1] "In Acid-base reaction page why does it create water and not H+?"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519