1

The point is that im trying to remove some weird words (like <U+0001F399><U+FE0F>) from my text corpus to do some twitter analysis.

There are many words like that that i just can't remove by using <- tm_map(X, removeWords).

i have plenty of tweets agregated in a dataset. Then i use the following code:

corpus_tweets <- tm_map (corpus_tweets, removeWords, c("<U+0001F339>", "<U+0001F4CD>")) if i try changing those weird words for regular ones (like "life" or "animal") that also appear on my dataset the regular ones get removed easily.

Any idea of how to solve this?

Darren Cook
  • 27,837
  • 13
  • 117
  • 217

2 Answers2

0

As these are Unicode characters, you need to figure out how to properly enter them in R.

The escape code syntax for Unicode in R probably is not <U+xxxx>, but rather something like \Uxxxx. See the manual for details (I don't use R - I am too annoyed by its inconsistencies. This is even an example for such an inconsistency, where apparently the string is printed differently than what R would accept as input.)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("\U0001F339", "\U0001F4CD","\uFE0F","\uFE0E"))

NOTE: You use a slash and lowercase u then 4 hex digits to specify a character from Unicode plane 0; you must use uppercase U then 8 hex digits for the other planes (which are typically emoji, given you are working with tweets).

BTW, see Some emojis (e.g. ☁) have two unicode, u'\u2601' and u'\u2601\ufe0f'. What does u'\ufe0f' mean? Is it the same if I delete it? for why you are getting the FE0F in there: they are when the user wants to choose a variation of an emoji, e.g. to add colour. FE0E is its partner (to say you want the plain text glyph).

Darren Cook
  • 27,837
  • 13
  • 117
  • 217