5

Lets say I have the following string in R:

text <- "[Peanut M&M\u0092s]"

I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:

replaced <- gsub("\\\\u0092", "", text )

However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?

Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?

Simon
  • 9,762
  • 15
  • 62
  • 119

1 Answers1

5

You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:

text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"

See IDEONE demo

If you only plan to remove the \0092 symbol, you do not need a Perl like regex:

replaced <- gsub("[][\u0092]", "", text)

See another demo

Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hi, I've tried the first solution and do not get "Peanut M&Ms", it remains: "Peanut M&M\u0092s" second one solves it though. Wiktor could you check results from your first solution one more time? I've entered it into R exactly same script with different results. – mkrasmus Feb 01 '19 at 01:12
  • Actually now I wonder if it has something to do with newer versions of R, given this question: https://stackoverflow.com/questions/36108790/trouble-with-strings-with-u0092-unicode-characters – mkrasmus Feb 01 '19 at 01:30
  • @mkrasmus Make sure the `T` is not redefined, use `perl=TRUE` always. Not sure what the root cause is. – Wiktor Stribiżew Feb 01 '19 at 16:30
  • Hi Wiktor, as noted, exactly the same command (including the argument perl = TRUE or perl=T) - different results i.e. replaced = "Peanut M&M\u0092s" – mkrasmus Feb 03 '19 at 23:34