4

I’ve got a df that consists of Twitter handles that I wish to scrape on a regular basis.

df=data.frame(twitter_handles=c("@katyperry","@justinbieber","@Cristiano","@BarackObama"))

My Methodology

I would like to run a for loop that loops over each of the handles in my df and creates multiple dataframes:

1) By using the rtweet library, I would like to gather tweets using the search_tweets function.

2) Then I would like to merge the new tweets to existing tweets for each dataframe, and then use the unique function to remove any duplicate tweets.

3) For each dataframe, I'd like to add a column with the name of the Twitter handle used to obtain the data. For example: For the database of tweets obtained using the handle @BarackObama, I'd like an additional column called Source with the handle @BarackObama.

4) In the event that the API returns 0 tweets, I would like Step 2) to be ignored. Very often, when the API returns 0 tweets, I get an error as it attempts to merge an empty dataframe with an existing one.

5) Finally, I would like to save the results of each scrape to the different dataframe objects. The name of each dataframe object would be its Twitter handle, in lower case and without the @

My Desired Output

My desired output would be 4 dataframes, katyperry, justinbieber, cristiano & barackobama.

My Attempt

library(rtweet)
library(ROAuth)

#Accessing Twitter API using my Twitter credentials

key <-"yKxxxxxxxxxxxxxxxxxxxxxxx"
secret <-"78EUxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(key,secret)

#Dataframe of Twitter handles    
df=data.frame(twitter_handles=c("@katyperry","@justinbieber","@Cristiano","@BarackObama"))

# Setting up the query
query <- as.character(df$twitter_handles)
query <- unlist(strsplit(query,","))
tweets.dataframe = list()

# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
  result<-search_tweets(query[i],n=10000,include_rts = FALSE)
  #Strip tweets that  contain RTs
  tweets.dataframe <- c(tweets.dataframe,result)
  tweets.dataframe <- unique(tweets.dataframe)
}

However I have not been able to figure out how to include in my for loop the part which ignores the concatenation step if the API returns 0 tweets for a given handle.

Also, my for loop does not return 4 dataframes in my environment, but stores the results as a Large list

I identified a post that addresses a problem very similar to the one I face, but I find it difficult to adapt to my question.

Your inputs would be greatly appreciated.

Edit: I have added Step 3) in My Methodology, in case you are able to help with that too.

Varun
  • 1,211
  • 1
  • 14
  • 31
  • what is the datatype of result when it dont have any tweet? If its data.frame then `if(nrow(result) == 0) next` could help. By LargeList do you mean https://cran.r-project.org/web/packages/largeList/largeList.pdf which is unlikely as package is not using it https://cran.r-project.org/web/packages/rtweet/index.html – abhiieor Apr 24 '18 at 08:25
  • the datatype is an empty dataframe. Yes it seems like a Large List, but I am sure that my for loop needs some kind of modification for the results to make sense. – Varun Apr 24 '18 at 08:28

1 Answers1

3
tweets.dataframe = list()

# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
  result<-search_tweets(query[i],n=10,include_rts = FALSE)

  if (nrow(result) > 0) {  # only if result has data
    tweets.dataframe <- c(tweets.dataframe, list(result))
  }
}

# tweets.dataframe is now a list where each element is a date frame containing
# the results from an individual query; for example...

tweets.dataframe[[1]]

# to combine them into one data frame

do.call(rbind, tweets.dataframe)

in response to a reply...

twitter_handles <- c("@katyperry","@justinbieber","@Cristiano","@BarackObama")

# Loop through the twitter handles & store the results as individual dataframes
for(handle in twitter_handles) {
  result <- search_tweets(handle, n = 15 , include_rts = FALSE)
  result$Source <- handle

  df_name <- substring(handle, 2)

  if(exists(df_name)) {
    assign(df_name, unique(rbind(get(df_name), result)))
  } else {
    assign(df_name, result)
  }
}
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56
  • Thank you.Though this `for` loop does not seem to be adding new tweets to the existing individual dataframes. I have tested this using `min()` and `max()` functions on the `created_at` variable. Is there a way to ensure that every time this loop is rerun, only new tweets are added, and existing tweets are left intact? – Varun Apr 24 '18 at 09:43
  • Also, I have made a small edit to my question and added a Step 3 to my methodology. I'd really appreciate if you could help guide me on that. Thanks. – Varun Apr 24 '18 at 10:21
  • you do not have any "existing individual dataframes"... what/where are they? – CJ Yetman Apr 24 '18 at 10:23
  • When I said existing dataframes, I meant that for future re-runs of the loop, I'd like the new tweets to be added to the dataframe of tweets collected during the previous iteration. So each time I rerun the loop and gather the latest tweets, the number of rows in my dataframe should be greater than the number of rows in my dataframe before the loop was rerun. – Varun Apr 24 '18 at 10:36
  • added a new solution that creates new or adds to existing data frames – CJ Yetman Apr 24 '18 at 10:58