I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new dataframe that contains the amount of time connected per user/per location. Let's call this hourly connected.
Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month.
The code underneath shows how the final dataframe is developed, although the total code is much complexer. Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM.
library(dplyr)
numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")
df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
postalcode = round(runif(numsessions,1,100)),
daynr = format(startdate,"%w"),
start =startdate ,
end= startdate + runif(1,1,60*60*10)
)
dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
)
We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. A first attempt was to use the multidplyr
package. The partition is made based on daynr
df.hourlyconnected<-df.Sessions %>%
partition(daynr,cluster=init_cluster(6)) %>%
rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
) %>% collect()
Now, the rowwise
function appears to require a dataframe as input instead of a partition.
My questions are
Is there a workaround to perform a rowwise calculation on partitions per core?
Has anyone got a suggestion to perform this calculation with a different R package and methods?