1

I want to use a foreach each loop running with a doParallel backend for getting tweets from a MySQL database with the RMySql package.

I create a connection to the database for every user id I want to query, then I get every tweet from that user by 200 batches. If the batch size is 0 (so there are no further tweets) I query next user id.

I want to store the information in a dataframe called tweets, which has columns for the number of hashtags in a tweet and a column with dates. For every tweet I want to find out how many hashtags it has and in which month it was created. Then I want to increase the number in the dataframe by 1.

So how can I write the results for every tweet in the dataframe?

My dataframe in the beginning:

| dates    | zero_ht | one_ht | two_ht | three_ht | four_ht | five_ht |
|----------|---------|--------|--------|----------|---------|---------|
| 01/01/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/02/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/03/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/04/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/05/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/06/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/07/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/08/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/09/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/10/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/11/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/12/13 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/01/14 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/02/14 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/03/14 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/04/14 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/05/14 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/06/14 | 0       | 0      | 0      | 0        | 0       | 0       |
| 01/07/14 | 0       | 0      | 0      | 0        | 0       | 0       |

My code:

x<- foreach(i=1:nrow(ids) ,.packages=c("DBI", "RMySQL"),.combine=rbind ) %dopar% {

con <- dbConnect(MySQL(), *CREDENTIALS*)

start <- 0

length <- 1
while(length > 0)
{
query <- *QUERY*
data <- dbGetQuery(con, query)

length <- nrow(data)

#print(paste("Starting at ",start,sep=""))

for(j in 1:length)
{   
    if(length==0)
    {

    }
    else{ 

    #get the number of hashtags used
    number <-   nchar((gsub("[^#]","",data$message[j])))

    #get the date the tweet was created
    date <- paste(format(as.Date(data$created_at[j]), "%Y-%m"),"-01",sep="")
    # just use it when there are less than 5 hashtags
    if(number < 5)
    {

        if(number==0)
        {


        tweets[tweets$dates==date,2] <- tweets[tweets$dates==date,2]+1


        }
        else{
            tweets[tweets$dates==date,number+1] <- tweets[tweets$dates==date,number+1]+1


        }

    }

}    
}
#increase the start by 200; to get the next 200 tweets
start <- start + 200

}
data.frame(date=date,number=number)
dbDisconnect(con) 
}
JulianHi
  • 286
  • 2
  • 4
  • 14
  • Wait - so does this code not currently work properly? If not, could you please add the error messages and/or describe what is not working correctly. Since you are using a data source (personal database) that other people presumably cannot access, you will most likely have to provide a lot of detail about your problem. – nrussell Jul 22 '14 at 14:25
  • Hey thanks for your comment. Actually there is no error message, but the tweet dataframe stays filled with 0´s. And the variable X just contains a list with "true" – JulianHi Jul 22 '14 at 14:26
  • Hmm to be honest I've never used RMySql or MySQL (I use SQL Server and RODBC), but regardless I've never tried to fetch SQL data in parallel because my gut feeling has always been that there would be some backend complications with this that would not arise in a non-parallel counterpart scenario. I did see [this similar post](http://stackoverflow.com/questions/9931802/parallelizing-sql-queries-in-r) that might provide you with some useful information though. – nrussell Jul 22 '14 at 14:44
  • Generally speaking though, I would say that unless you are running such a complex computation / process that parallelization is absolutely critical, you should stick with a non-parallel equivalent. Also, you should try to run this same code not in parallel (e.g. `%do$ instead of %dopar%`) and what happens - this way you will most likely be able to tell whether the issue is with your code or just the fact that you are trying to execute it in parallel. – nrussell Jul 22 '14 at 14:53
  • The code works with a normal for-loop and with a foreach loop using %do%. But it consists of a lot of SQL queries what makes it timeconsuming. So using this code in a not parallel way works, but is extremely slow – JulianHi Jul 22 '14 at 14:55
  • Hey, I don't really use `RMySQL` much, but is the problem with the return value of your function? Can you try putting `data.frame(date=date,number=number)` on the *last* line of the function, i.e. *after* `dbDisconnect(con)`. – konvas Jul 22 '14 at 19:31

1 Answers1

0

Thanks to the comments I could solve the problem: The reason for the list with just "TRUE"s in it, was that the last command in the foreach loop was

dbDisconnect(con) 

And when the database connection was closed successfully it returns a "TRUE".

So I just had to swap the last two lines and make

data.frame(date=date,number=number)

and everything worked fine.

Regards

JulianHi
  • 286
  • 2
  • 4
  • 14