0

I got a weird case in my dataframe while working with emojis in R. I want to delete all emojis for a sentiment analysis. When I do this I got some cases, where the string should be empty, but isn't. What is the problem? I would like to replace empty fields with NA. Here a little example:

library(tidyverse)

df <- data.frame(x = c("test","♥️♥️♥"))

nchar(df$x[2])

df_new <- df |>
  mutate(x = str_remove_all(x, "[[:emoji:]]"))

is_empty(df_new$x[2])

Now I would like to use the following command, but this doesn't work, because the string is not empty.

tmp <- df_new |>
  mutate(x = na_if(x, ""))

What is the problem here and how I can solve this?

Thank you in advance,

Aaron

2 Answers2

3

If you run charToRaw(df_new$x[2]) you can see the bytes left in the string

charToRaw(df_new$x[2])
# [1] ef b8 8f ef b8 8f

Those appear to be variant selector commands for the emoji which are not included in the ":emoji:" character class so they are not removed.

What exactly do you want to keep? It might be easier to write a regular expression for that. There are a lot of weird unicode code points out there.

You could try to remove those values using the regular expression from here

df_new <- df |>
  mutate(x = str_remove_all(x, "[[:emoji:]]")) |>
  mutate(x = str_remove_all(x, "[\\u180B-\\u180D\\uFE00-\\uFE0F]|\\uDB40[\\uDD00-\\uDDEF]"))

But you want to check for contents with nchar(df_new$x), not is_empty. An empty string is not "empty": is_empty("") returns FALSE.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
2

If you want to remove all non-characters but support any language and if you already split up your x values as words you can simply do:

df <- data.frame(x = c("test","♥️♥️♥"))

library(stringi)

df %>%
  mutate(x = stri_extract_all(x, charclass = "\\p{L}"))

     x
1 test
2   NA

If you have strings with multiple words you can slightly adapt above and use this instead

df <- data.frame(x = c("Ελλάδα means Greece ♥️ ️", "test","♥️♥️♥"))

df %>%
  group_by(x) %>%
  mutate(x = paste(stri_extract_all(x, charclass = "\\p{L}")[[1]], collapse = " "))

1 Ελλάδα means Greece
2 test           
3 NA  
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22