0

Original Post

I am trying to gather some tweets using the {rtweet} package and store them in a MySQL database. But I am getting errors whenever I try to upload a tweet that contains emojis.

Here is my code:

# loading packages
library(rtweet)
library(DBI)
library(RMariaDB)
library(dplyr)
library(dbplyr)
library(lubridate)

# create twitter api token
twitterToken <- rtweet_bot(
  api_key = "*****",
  api_secret = "*****",
  access_token = "*****",
  access_secret = "*****"
)

# search tweets
tweets <- search_tweets(q = "beautiful", n = 500, type = "recent", include_rts = FALSE, token = twitterToken)
tweets$Topic <- "beautiful"
tweets$created_at  <- parse_date_time(tweets$created_at, orders = c("%a %b %d %T %z %Y", "%Y-%m-%d %H:%M:%S"))
tweets$screen_name <- users_data(tweets)$screen_name
tweets$status_url <- paste0("https://twitter.com/", tweets$screen_name, "/status/", tweets$id_str)
tweets <- tweets %>% select(Topic, status_url, screen_name, created_at, text, favorite_count, retweet_count)

# upload to database
con <- dbConnect(MariaDB(), dbname="dbname", username="username", password="password", host="db_host", port=db_port, ssl.ca = "ssl.pem", load_data_local_infile = TRUE)
dbWriteTable(con, "testTweetsDB", tweets, overwrite = T)

This throws the following error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"MariaDBConnection", "character", "tweets"’

This is a bit odd because it used to work before I upgraded my R version and updated all the packages. But is not the main issue. I can work around this by the following codes:

tweets <- as.data.frame(tweets)
dbWriteTable(con, "testTweetsDB", tweets, overwrite = T)

This time I get the following error:

Error: Error executing query: Invalid utf8 character string: '@Alyri_tv So beautifull girl '

The string it complains about is the first tweet that contains emojis. It works perfectly fine if I only select tweets that don't have any emojis in them. It works even if the tweets contain Chinese, Korean and other language characters. It is the emojis that are causing the problem. The default collation for the database is utf8mb4_unicode_ci

Update 1

Here is the output from dput(tweets[1:10,])

structure(list(Topic = c("beautiful", "beautiful", "beautiful", 
"beautiful", "beautiful", "beautiful", "beautiful", "beautiful", 
"beautiful", "beautiful"), status_url = c("https://twitter.com/supyo192/status/1623573544645926917", 
"https://twitter.com/Ghana_Ronaldo/status/1623573542267756545", 
"https://twitter.com/ValentinoAndy77/status/1623573539906281479", 
"https://twitter.com/Gino15618462/status/1623573536550838273", 
"https://twitter.com/kaylenicodemus/status/1623573533468000256", 
"https://twitter.com/beace_tw/status/1623573527688347651", "https://twitter.com/JoyceNwanochi/status/1623573525968691200", 
"https://twitter.com/Adasu_d_gr8/status/1623573525742120961", 
"https://twitter.com/AsadAli62560407/status/1623573523070390276", 
"https://twitter.com/HidingWolfe/status/1623573516313407488"), 
    screen_name = c("supyo192", "Ghana_Ronaldo", "ValentinoAndy77", 
    "Gino15618462", "kaylenicodemus", "beace_tw", "JoyceNwanochi", 
    "Adasu_d_gr8", "AsadAli62560407", "HidingWolfe"), created_at = structure(c(1675946672, 
    1675946671, 1675946671, 1675946670, 1675946669, 1675946668, 
    1675946667, 1675946667, 1675946667, 1675946665), class = c("POSIXct", 
    "POSIXt"), tzone = "UTC"), text = c("@meeshell___ THESE ARE BEAUTIFUL!! ❤\nLove! Love! &amp; Good Vibes to you!! ", 
    "Good morning beautiful and handsome people , pls how are u doing today ?? ", 
    "@geegeebanks Good morning have a beautiful and happy Thursday ❤   ", 
    "@hossain_mita \"I learned that in feelings we are guided by mysterious laws, perhaps fate or perhaps a mirage, in any case something inexplicable. Because... after all.. there is never a reason why you fall in love. It happens and Enough.\" Good morning dear beautiful soul", 
    "hings are beautiful if you love them\n\n&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;S50&lt;&lt;&lt;&lt;&lt;&lt;&lt;\n&gt;&gt;&gt;&gt;&gt;&gt;S50&lt;&lt;&lt;&lt;&lt;&lt;&lt;\n&gt;&gt;&gt;&gt;&gt;&gt;S50&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;", 
    "子育てママはなぜ燃える!? #ナイトブラ", 
    "@isaaczara_ This looks beautiful! \nWhole I'm not a designer, I've been doing some work in Canva these past months, and recently stumbled upon the \"Intro rust\" font. I love how sleek it it!\n\nMaybe some day it can feature in your series.\n\nThese are equally beautiful too.", 
    "Because she is extremely beautiful, gorgeous and charming does not mean you should over look her .\n\nThose red flags will swallow up everything you are seeing.", 
    "@farhat121212 Beautiful picture", "@MistressSnowPhD Now that is some beautiful ink. I’m sure he’d love it."
    ), favorite_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L), retweet_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L)), row.names = c(NA, 10L), class = "data.frame")

If I try dbWriteTable(con, "testTweetsDB", tweets[c(5,6,7,9,10),], overwrite = T), it works fine.

If I try dbWriteTable(con, "testTweetsDB", tweets[-c(5,6,7,9,10),], overwrite = T), gives me the error Error: Error executing query: Invalid utf8 character string: '@meeshell___ THESE ARE BEAUTIFUL!! '

Also, the table testTweetsDB does not exist on the database. I am relying on the dbWriteTable() function to create the table.

  • What OS are you on? Are you using Windows or something else? What encoding is used by your local `tweets` data. What does `names(as.data.frame(tweets))` return? I'm not sure what column the tweets are in but you can check the encoding with `Encoding(as.data.frame(tweets)$tweetcolumnname)` once you figure it out. – MrFlick Feb 08 '23 at 15:01
  • I am on Windows 11. `names(tweets)` returns `[1] "Topic" "status_url" "screen_name" "created_at" "text" "favorite_count" "retweet_count"` The column named `text` holds the tweets. And when I execute `Encoding(tweets$text)` it returns a vector that consists of two values either `UTF-8` or `unknown`. Tweets that do not have any emojis returns `unknown` and tweets that have emojis returns `UTF-8` – Wahiduzzaman Khan Feb 08 '23 at 18:38
  • So does the `testTweetsDB` table already exist? can you run `SHOW CREATE TABLE testTweetsDB` in your database to see how the columns are created? it sounds like the database column may be of the wrong type. – MrFlick Feb 08 '23 at 19:51
  • Also it seems you may have crossposted this issue to the github repo: https://github.com/r-dbi/RMariaDB/issues/295. If you do that it's better to let people know so there's not duplicated effort solving the same problem. It would be a lot easier if you procivided some sort of reproducible example. We don't need any of the `rtweet` code. Just provide a sample data.frame with a "good" row and a "bad" row in a `dput()` format so we can actually run and test the code ourselves. – MrFlick Feb 08 '23 at 19:53
  • Maybe also run `SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;` to see what the default character set it (the collation seems less relevant) – MrFlick Feb 08 '23 at 19:57
  • @MrFlick I have updated my post. It now has a reproducible example. Also from the output of `dbGetQuery(con, "SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;")` I can confirm that the `DEFAULT_CHARACTER_SET_NAME` is `utf8mb4` – Wahiduzzaman Khan Feb 09 '23 at 07:51

0 Answers0