1

I have a matrix of many million values. One column is a weirdly formatted date, which I am converting to an actual datetime that I can sort.

I want to speed this up and do it in parallel. I've had success doing minor things before in Parallel, but that was easy because I wasn't actively changing an existing matrix.

How do I do this in parallel? I can't seem to figure it out...

The code I want to parallelize is...

len = dim(combinedDF)[1]
for(j in 1:len)
{
    sendTime = combinedDF[j, "tweetSendTime"]
    sendTime = gsub(" 0000", " +0000", sendTime)
    updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
    combinedDF[j, "tweetSendTime"] = toString(updatedTime)
}

EDIT : I was told to also try apply. I tried...

len = dim(combinedDF)[1]
### Using apply
apply(combinedDF,1, function(combinedDF,y){
sendTime = combinedDF[y, "tweetSendTime"]
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[y, "tweetSendTime"] = toString(updatedTime)
combinedDF[y,]
}, y=1:len)

However that nets an error when the }, processes, giving me "Error in combinedDF[y,"tweetSendTime"] -- incorrect number of dimensions.

Edit :

updateTime = function(timeList){
sendTime = timeList
sendTime = gsub(" 0000", " +0000", sendTime)
updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
toString(updatedTime)
} 


apply(as.matrix(combinedDF[,"tweetSendTime"]),1,updateTime)

Seems to work

Jibril
  • 967
  • 2
  • 11
  • 29

2 Answers2

1

Since you're just modifying a single column of combinedDF, and gsub and strptime are vector functions, you don't need to use a loop or any kind of "apply" function:

sendTime <- gsub(" 0000", " +0000", combinedDF[, "tweetSendTime"])
updatedTime <- strptime(sendTime, "%a %b %d %H:%M:%S %z %Y")
combinedDF[, "tweetSendTime"] <- as.character(updatedTime)

Note that I used as.character since it is a vector function, while toString is not.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
0

I usually use doParallel for parallel execution:

library(doParallel)
ClusterCount = 2 # depends on the threads you want to use
cl <- makeCluster(ClusterCount)
registerDoParallel(cl)
len = dim(combinedDF)[1]
combinedDF <- foreach(j = 1:len,.combine = rbind) %dopar% {
    sendTime = combinedDF[j, "tweetSendTime"]
    sendTime = gsub(" 0000", " +0000", sendTime)
    updatedTime = strptime( sendTime, "%a %b %d %H:%M:%S %z %Y")
    combinedDF[j, "tweetSendTime"] = toString(updatedTime)
    combinedDF[j,]
}
stopCluster(cl)

however it should be mentioned that what you are doing does not seem to be computationally expensive, but requieres many iterations. You should consider rewriting your code, as loops are not very fast in R and that an apply() based attempt should speed up your code more than a parallel attempt.

David Go
  • 810
  • 1
  • 7
  • 13
  • Hello! This is exactly the attempt I tried right after posting...however it is still running. I've heard this a few times before when I've done things in R. Is it just a rule-of-thumb that after a certain point, use apply over loops then? I've actually never experimented with it, I guess I will have to go figure that out if that's the case. – Jibril Mar 21 '16 at 23:22
  • About your code: It would help a lot if you provide sample data so I can actually test the code. This is also wanted so anybody can reproduce your problem. Speaking of loops in R: try to avoid them as much as possible, they are not considered a good programming style. As mentioned in this video at minute 26: https://www.youtube.com/watch?v=6S9r_YbqHy8 Also you can read about speeding up loops here: http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r – David Go Mar 22 '16 at 11:31