5

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.

A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.

Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart" The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.

Dummy data set reproducible with by using dput() (including the step force to lowercase:

Emoji_struct <- c(
      list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),  
      list(content = "", " thanks for helping",  " oh no, why? ", "careful, challenging ❌❌❌")
)

Current coding (data_orig is a list of several files):

library(textclean)
#The rest should be standard r packages for pre-processing

#pre-processing:
data <- gsub("'", "", data) 
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data)  #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data) 
data <- gsub("[[:digit:]]", "", data)  #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)

Desired output:

[1] list(content = c("fire fire wow", 
                     "facewithopenmouth look at that", 
                     "facewithsteamfromnose this makes me angry facewithsteamfromnose", 
                     "smilingfacewithhearteyes redheart \ufe0f, i love it!"), 
         content = c("smilingfacewithhearteyes smilingfacewithhearteyes", 
                     "smilingfacewithsmilingeyes thanks for helping", 
                     "cryingface oh no, why? cryingface", 
                     "careful, challenging crossmark crossmark crossmark"))

Any ideas? Lower cases would work, too. Best regards. Stay safe. Stay healthy.

slamballais
  • 3,161
  • 3
  • 18
  • 29
TR_IBK21
  • 67
  • 4

1 Answers1

2

Answer

Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:

hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)

replace_emoji(Emoji_struct[,1], emoji_dt = hash2)

Example

Single character string:

replace_emoji("wow! that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"

Character vector:

replace_emoji(c("1: ", "2: "), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "

List:

list("list_element_1: ", "list_element_2: ❌") %>%
  lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "

Rationale

To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):

head(lexicon::hash_emojis)
#              x                        y
#1: <e2><86><95>            up-down arrow
#2: <e2><86><99>          down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a>                    watch
#6: <e2><8c><9b>           hourglass done

This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

slamballais
  • 3,161
  • 3
  • 18
  • 29
  • That works for the dummy data set. The problem is that I got a list with a mixture of text an emojis. Your code doesn't work there, or I couldn't get it to work, since it returns `Error in data[, 1] : incorrect number of dimensions` Do you have solution for that? – TR_IBK21 May 18 '21 at 12:11
  • Could you provide a new dummy set that better represents the actual data? – slamballais May 18 '21 at 14:00
  • There we go. This should do the trick now... Your solution turned all the words into one massive word and R wouldn't let me split it back up into words... Also I didn't quite understand how to use `lapply` in your solution. – TR_IBK21 May 18 '21 at 14:28
  • I updated the answer given the new requirements, with suggestions on how to implement new emojis as well. – slamballais May 18 '21 at 15:43
  • Thanks. :) That's what I was looking for. – TR_IBK21 May 18 '21 at 20:52