0

I want to use R to perform some analytics on Twitter posts, such as this Tweet by Donald Trump (pulled via the Twitter API):

"Join me LIVE in South Korea\U0001f1fa\U0001f1f8\U0001f1f0\U0001f1f7\n#NationalAssembly #POTUSinAsia"

First I would like to know if these is a regular expression that I can use to select the escaped unicode (e.g.: \U0001f1f8).

Expressions that I would assume would work, such as this: \\[[:alnum:]]{9} do not work. I got an interesting error message, however:

Error in grepl("\[[:alnum:]]{9}", x, perl = T) : invalid regular expression '[[:alnum:]]{9}' In addition: Warning message: In grepl("\[[:alnum:]]{9}", x, perl = T) : PCRE pattern compilation error 'POSIX named classes are supported only within a class' at '[:alnum:]]{9}'

Also, I'd like to know if there is a way I can convert these escaped unicode back into the characters they are supposed to represent so I can display them to the user on the front-end of the application.

Christopher Costello
  • 1,186
  • 2
  • 16
  • 30

1 Answers1

2

You can do this using iconv. This will remove all Non-ASCII characters that includes your foreign Unicode characters.

teststring <- "Join me LIVE in South Korea\U0001f1fa\U0001f1f8\U0001f1f0\U0001f1f7\n#NationalAssembly #POTUSinAsia"

iconv(teststring, "latin1", "ASCII", sub="")
#[1] "Join me LIVE in South Korea\n#NationalAssembly #POTUSinAsia"
Santosh M.
  • 2,356
  • 1
  • 17
  • 29