2

I have been working on a project related to Sentiment Analysis on Emojis. And I only want tweets with emojis in them and I don't want to do it manually So, is there any way that I could make some changes in the below code that will result only in the tweets that have emoticons in them. So, let's say that if I scrape 100 tweets, those 100 tweets must have some kind of emojis with some text. Any help will be highly appreciated.

For example, I only want tweets like this:

when is @McDonalds_SA gonna let us add spicy sauce on our veg burgers when we order on MrD or Uber eats 

Code:

get_token() # Connects with Twitter API
Uber <- search_tweets("uber", n = 100, lang = "en")

2 Answers2

1

Note: I assume you're not looking for all emoji, since they include quite common characters:

enter image description here

(from https://unicode.org/Public/UNIDATA/emoji/emoji-data.txt)

Unicode library

To get the Unicode block for one or more characters, we can use the Unicode library:

library("Unicode") # install.packages("Unicode")

A few examples:

> "" |> utf8ToInt() |> u_char_properties("Block")

            Block
U+1F60E Emoticons
> "‍‍" |> utf8ToInt() |> u_char_properties("Block")

                                  Block
1 Miscellaneous Symbols and Pictographs
2                   General Punctuation
3 Miscellaneous Symbols and Pictographs
4                   General Punctuation
5 Miscellaneous Symbols and Pictographs
> "" |> utf8ToInt() |> u_char_properties("Block")

                                       Block
U+1F910 Supplemental Symbols and Pictographs
> "✅" |> utf8ToInt() |> u_char_properties("Block")

          Block
U+2705 Dingbats
> "☝️" |> utf8ToInt() |> u_char_properties("Block")

                       Block
U+261D Miscellaneous Symbols
U+FE0F   Variation Selectors
> "☎️" |> utf8ToInt() |> u_char_properties("Block")

                       Block
U+260E Miscellaneous Symbols
U+FE0F   Variation Selectors
> "♍" |> utf8ToInt() |> u_char_properties("Block")

                       Block
U+264D Miscellaneous Symbols
> "" |> utf8ToInt() |> u_char_properties("Block")

                                        Block
U+1FAC3    Symbols and Pictographs Extended-A
U+1F3FD Miscellaneous Symbols and Pictographs
> "" |> utf8ToInt() |> u_char_properties("Block")

                            Block
U+1F682 Transport and Map Symbols

Matching all emoji-like characters could be done like this:

blocks <- c("Emoticons",
            "Miscellaneous Symbols and Pictographs",
            "Supplemental Symbols and Pictographs",
            "Dingbats",
            "Miscellaneous Symbols",
            "Symbols and Pictographs Extended-A",
            "Transport and Map Symbols")
> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "‍‍" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "☎️" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "♍" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE
> "#" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] FALSE

Integrating into your code

library(rtweet)
library(dplyr)

get_token() # Connects with Twitter API
Uber <- search_tweets("uber", n = 100, lang = "en")

Uber_filtered <- Uber %>%
  rowwise() %>%
  filter(text |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0)
Caspar V.
  • 1,782
  • 1
  • 3
  • 16
-1

Just a simple solution based on a regex of all emojis. Let me know if this works.

library(rtweet)
library(dplyr)
library(stringr)

get_token()
uber <- search_tweets("uber", n = 2000, lang = "en")
emoji_regex <- "||||☺️||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||✨||||||||||||||||||||||✊|✌️||✋|✋||||||||☝️|||||||||||||||||||||||||||||||||||||||||||||||||||||||❤️|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||⭐|☀️|⛅|☁️|⚡|☔|❄️|⛄||||||||||||||||||||||||||||||||||||☎️|☎️|||||||||||||||⏳|⌛|⏰|⌚||||||||||||||||||||||||||||||||||||||||✉️|✉️|||||||||||||||||||||||||✂️|||✒️|✏️|||||||||||||||||||||||||||||||||||||||⚽|⚾️|||||⛳|||||||||||☕||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||⛪|||||||⛺||||||||||||⛲|||⛵|⛵|||⚓||✈️||||||||||||||||||||||||||||||||||||||||||⚠️|||⛽||||♨️|||||||||||||||||1️⃣|2️⃣|3️⃣|4️⃣|5️⃣|6️⃣|7️⃣|8️⃣|9️⃣|0️⃣|||#️⃣||⬆️|⬇️|⬅️|➡️||||↗️|↖️|↘️|↙️|↔️|↕️||◀️|▶️|||↩️|↪️|ℹ️|⏪|⏩|⏫|⏬|⤵️|⤴️||||||||||||||||||||||||||||||️|♿||️||️|Ⓜ️||||||㊙️|㊗️||||||||||||⛔|✳️|❇️|❎|✅|✴️|||||️|️||️||➿|♻️|♈|♉|♊|♋|♌|♍|♎|♏|♐|♑|♒|♓|⛎||||||©️|®️|™️|❌|‼️|⁉️|❗|❗|❓|❕|❔|⭕|||||||||||||||||||||||||||||||✖️|➕|➖|➗|♠️|♥️|♣️|♦️|||✔️|☑️|||➰|〰️|〽️||◼️|◻️|◾|◽|▪️|▫️||||⚫|⚪||||⬜|⬛||||"
filter(uber, str_detect(text, emoji_regex))
dcsuka
  • 2,922
  • 3
  • 6
  • 27