R - Regex to remove foreign unicode characters

Question

I want to use R to perform some analytics on Twitter posts, such as this Tweet by Donald Trump (pulled via the Twitter API):

"Join me LIVE in South Korea\U0001f1fa\U0001f1f8\U0001f1f0\U0001f1f7\n#NationalAssembly #POTUSinAsia"

First I would like to know if these is a regular expression that I can use to select the escaped unicode (e.g.: \U0001f1f8).

Expressions that I would assume would work, such as this: \\[[:alnum:]]{9} do not work. I got an interesting error message, however:

Error in grepl("\[[:alnum:]]{9}", x, perl = T) : invalid regular expression '[[:alnum:]]{9}' In addition: Warning message: In grepl("\[[:alnum:]]{9}", x, perl = T) : PCRE pattern compilation error 'POSIX named classes are supported only within a class' at '[:alnum:]]{9}'

Also, I'd like to know if there is a way I can convert these escaped unicode back into the characters they are supposed to represent so I can display them to the user on the front-end of the application.

score 2 · Answer 1 · answered Nov 19 '17 at 20:48

You can do this using iconv. This will remove all Non-ASCII characters that includes your foreign Unicode characters.

teststring <- "Join me LIVE in South Korea\U0001f1fa\U0001f1f8\U0001f1f0\U0001f1f7\n#NationalAssembly #POTUSinAsia"

iconv(teststring, "latin1", "ASCII", sub="")
#[1] "Join me LIVE in South Korea\n#NationalAssembly #POTUSinAsia"

R - Regex to remove foreign unicode characters

1 Answers1