Scrape only Tweets with Emojis in R

Question

I have been working on a project related to Sentiment Analysis on Emojis. And I only want tweets with emojis in them and I don't want to do it manually So, is there any way that I could make some changes in the below code that will result only in the tweets that have emoticons in them. So, let's say that if I scrape 100 tweets, those 100 tweets must have some kind of emojis with some text. Any help will be highly appreciated.

For example, I only want tweets like this:

when is @McDonalds_SA gonna let us add spicy sauce on our veg burgers when we order on MrD or Uber eats

Code:

get_token() # Connects with Twitter API
Uber <- search_tweets("uber", n = 100, lang = "en")

Caspar V. · Accepted Answer · 2022-07-19T20:23:59.063

Note: I assume you're not looking for all emoji, since they include quite common characters:

(from https://unicode.org/Public/UNIDATA/emoji/emoji-data.txt)

Unicode library

To get the Unicode block for one or more characters, we can use the Unicode library:

library("Unicode") # install.packages("Unicode")

A few examples:

> "" |> utf8ToInt() |> u_char_properties("Block")

            Block
U+1F60E Emoticons

> "‍‍" |> utf8ToInt() |> u_char_properties("Block")

                                  Block
1 Miscellaneous Symbols and Pictographs
2                   General Punctuation
3 Miscellaneous Symbols and Pictographs
4                   General Punctuation
5 Miscellaneous Symbols and Pictographs

> "" |> utf8ToInt() |> u_char_properties("Block")

                                       Block
U+1F910 Supplemental Symbols and Pictographs

> "✅" |> utf8ToInt() |> u_char_properties("Block")

          Block
U+2705 Dingbats

> "☝️" |> utf8ToInt() |> u_char_properties("Block")

                       Block
U+261D Miscellaneous Symbols
U+FE0F   Variation Selectors

> "☎️" |> utf8ToInt() |> u_char_properties("Block")

                       Block
U+260E Miscellaneous Symbols
U+FE0F   Variation Selectors

> "♍" |> utf8ToInt() |> u_char_properties("Block")

                       Block
U+264D Miscellaneous Symbols

> "" |> utf8ToInt() |> u_char_properties("Block")

                                        Block
U+1FAC3    Symbols and Pictographs Extended-A
U+1F3FD Miscellaneous Symbols and Pictographs

> "" |> utf8ToInt() |> u_char_properties("Block")

                            Block
U+1F682 Transport and Map Symbols

Matching all emoji-like characters could be done like this:

blocks <- c("Emoticons",
            "Miscellaneous Symbols and Pictographs",
            "Supplemental Symbols and Pictographs",
            "Dingbats",
            "Miscellaneous Symbols",
            "Symbols and Pictographs Extended-A",
            "Transport and Map Symbols")

> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "‍‍" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "☎️" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "♍" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] TRUE

> "#" |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0
[1] FALSE

Integrating into your code

library(rtweet)
library(dplyr)

get_token() # Connects with Twitter API
Uber <- search_tweets("uber", n = 100, lang = "en")

Uber_filtered <- Uber %>%
  rowwise() %>%
  filter(text |> utf8ToInt() |> u_char_properties("Block") |> unlist() |> intersect(blocks) |> length() > 0)

There is an error in the filter part where it says: object 'blocks' not found. — M. Talha Bin Asif, Jul 19 '22 at 09:44
@M.TalhaBinAsif apologies, I forgot to copy that vector to SO. I've added it now. — Caspar V., Jul 19 '22 at 20:24

dcsuka · Answer 2 · 2022-07-19T01:04:41.680

Just a simple solution based on a regex of all emojis. Let me know if this works.

library(rtweet)
library(dplyr)
library(stringr)

get_token()
uber <- search_tweets("uber", n = 2000, lang = "en")
emoji_regex <- "||||☺️||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||✨||||||||||||||||||||||✊|✌️||✋|✋||||||||☝️|||||||||||||||||||||||||||||||||||||||||||||||||||||||❤️|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||⭐|☀️|⛅|☁️|⚡|☔|❄️|⛄||||||||||||||||||||||||||||||||||||☎️|☎️|||||||||||||||⏳|⌛|⏰|⌚||||||||||||||||||||||||||||||||||||||||✉️|✉️|||||||||||||||||||||||||✂️|||✒️|✏️|||||||||||||||||||||||||||||||||||||||⚽|⚾️|||||⛳|||||||||||☕||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||⛪|||||||⛺||||||||||||⛲|||⛵|⛵|||⚓||✈️||||||||||||||||||||||||||||||||||||||||||⚠️|||⛽||||♨️|||||||||||||||||1️⃣|2️⃣|3️⃣|4️⃣|5️⃣|6️⃣|7️⃣|8️⃣|9️⃣|0️⃣|||#️⃣||⬆️|⬇️|⬅️|➡️||||↗️|↖️|↘️|↙️|↔️|↕️||◀️|▶️|||↩️|↪️|ℹ️|⏪|⏩|⏫|⏬|⤵️|⤴️||||||||||||||||||||||||||||||️|♿||️||️|Ⓜ️||||||㊙️|㊗️||||||||||||⛔|✳️|❇️|❎|✅|✴️|||||️|️||️||➿|♻️|♈|♉|♊|♋|♌|♍|♎|♏|♐|♑|♒|♓|⛎||||||©️|®️|™️|❌|‼️|⁉️|❗|❗|❓|❕|❔|⭕|||||||||||||||||||||||||||||||✖️|➕|➖|➗|♠️|♥️|♣️|♦️|||✔️|☑️|||➰|〰️|〽️||◼️|◻️|◾|◽|▪️|▫️||||⚫|⚪||||⬜|⬛||||"
filter(uber, str_detect(text, emoji_regex))

Hey, thanks for replying but I am only looking to get only the tweets with emojis. Like all the tweets must have emojis in them. I tried your code but it is returning all kinds of tweets, with plain text and tweets with emojis. — M. Talha Bin Asif, Jul 19 '22 at 00:33
It gives this error after "emj <- emoji(list_emoji(), TRUE)" this line. Error:emj <- emoji(list_emoji(), TRUE). — M. Talha Bin Asif, Jul 19 '22 at 00:51
Try following the remoji install instructions here: https://stackoverflow.com/questions/43359066/how-can-i-match-emoji-with-an-r-regex — dcsuka, Jul 19 '22 at 00:53

Scrape only Tweets with Emojis in R

2 Answers2

Unicode library

Integrating into your code