7

I'm trying to remove some VERY special characters in my strings. i've read other post like:

  1. Remove all special characters from a string in R?
  2. How to remove special characters from a string?

but these are not what im looking for.

lets say my string is as following:

s = "who are í ½í¸€ bringing?"

i've tried following:

test = tm_map(s, function(x) iconv(enc2utf8(x), sub = "byte"))
test = iconv(s, 'UTF-8', 'ASCII')

none of above worked.

edit: I am looking for a GENERAL solution! I cannot (and prefer not) manually identify all the special characters.

also these VERY special characters MAY (not 100% sure) be result from emoticons

please help or guide me to the right posts. Thank you!

Community
  • 1
  • 1
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

1 Answers1

6

So, I'm going to go ahead and make an answer, because I believe this is what you're looking for:

> s = "who are í ½í¸€ bringing?"
> rmSpec <- "í|½|€" # The "|" designates a logical OR in regular expressions.
> s.rem <- gsub(rmSpec, "", s) # gsub replace any matches in remSpec and replace them with "".
> s.rem
[1] "who are  ¸ bringing?"

Now, this does have the caveat that you have to manually define the special character in the rmSpec variable. Not sure if you know what special characters to remove or if you're looking for a more general solution.

EDIT:

So it appears you almost had it with iconv, you were just missing the sub argument. See below:

> s
[1] "who are í ½í¸€ bringing?"
> s2 <- iconv(s, "UTF-8", "ASCII", sub = "")
> s2
[1] "who are   bringing?"
giraffehere
  • 1,118
  • 7
  • 18
  • ya... im looking for a more general solution. manually identifying all the special character is almost impossible in my case (since i have a very large data set and i prefer not going through them one by one to find out which/what special characters i have) – alwaysaskingquestions Feb 25 '16 at 22:24
  • @alwaysaskingquestions See my edit in my answer. There was an additional argument in `iconv` you were missing. – giraffehere Feb 25 '16 at 22:36
  • 1
    You can also specify a group of characters to be replaced like `gsub("[í½€¸]","",s)`, which is simpler than using `|` multiple times – thelatemail Feb 25 '16 at 22:44
  • @thelatemail I always forget this. Thanks for the addition! – giraffehere Feb 25 '16 at 22:53