I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji()
and replace_emoticon
functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier()
doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput()
(including the step force to lowercase
:
Emoji_struct <- c(
list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),
list(content = "", " thanks for helping", " oh no, why? ", "careful, challenging ❌❌❌")
)
Current coding (data_orig
is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too. Best regards. Stay safe. Stay healthy.