15

I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new dataframe that contains the amount of time connected per user/per location. Let's call this hourly connected.

Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month.

The code underneath shows how the final dataframe is developed, although the total code is much complexer. Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM.

library(dplyr)

numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")

df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
           postalcode = round(runif(numsessions,1,100)),
           daynr = format(startdate,"%w"),
              start =startdate ,
              end=   startdate + runif(1,1,60*60*10)
           )


dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
                                          hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
                                          hournr=format(seq(.$start,.$end,by=60*60),"%H")
                                          )
                               )

We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. A first attempt was to use the multidplyr package. The partition is made based on daynr

df.hourlyconnected<-df.Sessions %>% 
                      partition(daynr,cluster=init_cluster(6)) %>%
                      rowwise %>% do(data.frame(userID=.$userID,
                            hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
                            hournr=format(seq(.$start,.$end,by=60*60),"%H")
                              )
                            ) %>% collect()

Now, the rowwise function appears to require a dataframe as input instead of a partition.

My questions are

  • Is there a workaround to perform a rowwise calculation on partitions per core?

  • Has anyone got a suggestion to perform this calculation with a different R package and methods?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
JR Helmus
  • 311
  • 1
  • 12
  • 1
    [the CRAN Task View about HPC](https://cran.r-project.org/web/views/HighPerformanceComputing.html) might give you some ideas – Jaap Jan 26 '16 at 19:58
  • Thank you, Jaap, I was kind of afraid for going into high performance computing. I might have to read some new literature/pages before getting this code to work. Have you got specific hints for packages to look further into? – JR Helmus Jan 26 '16 at 20:04
  • 1
    besides that page, you might want to look into the `data.table` package (for speed and memory efficiency, especially for large datasets this imo a better choice than `dplyr`) or the `ff` package (can work with datasets on disk instead of in RAM) – Jaap Jan 26 '16 at 20:09
  • for `data.table` there are some [*Getting Stated Guides*](https://github.com/Rdatatable/data.table/wiki/Getting-started) on github – Jaap Jan 26 '16 at 20:11
  • I followed the DataCamp course on `data.table` I found it not quite easy but doable. Question of the matter is whether `data.table` is able to work with functions that have a new dataframe as output. That was the main reason for me to work with the `rowwise` function. – JR Helmus Jan 26 '16 at 20:14
  • dont know, just post a question on the `data.table` tag; the maintainers as well as several high profile users follow that tag closely; a well asked question will certainly get a good response – Jaap Jan 26 '16 at 20:17
  • 1
    You are doing *ALOT* of highly inefficient steps in your code. Why are you calculating `seq` twice by row? Why are you running `as.Date` by row? Why are you creating a `data.frame` by each row? Why `format` by row? When writing an efficient code you try to vectorize *as much as possible* an reduce `by` operations *as much as possible*. You are doing the exact opposite. – David Arenburg Jan 26 '16 at 20:36
  • 6
    Try the following simple improved code. I'd put my money it will perform much better than your multi-threaded one: ```library(data.table) ; res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = userID] ; res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1), V1 = NULL)]``` – David Arenburg Jan 26 '16 at 20:38
  • Thank you David, I will indeed try your code as soon as possible. I will open a new threat and try your code. you mention `by = userID` but users may be occuring up to 1,000 times in the dataset. Is this a problem in your code? – JR Helmus Jan 26 '16 at 20:43
  • Same user can have different start and end sessions? – David Arenburg Jan 26 '16 at 20:48
  • Yes, same user might enter the system between 1 and 1000 times in one year. – JR Helmus Jan 26 '16 at 20:49
  • 4
    Then first line should be `res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)] ` – David Arenburg Jan 26 '16 at 20:50
  • 6
    tried your code and hol* f* it runs fast! I really need to dive deeper into how you came up with this simple code, but for now all obeisance for you. – JR Helmus Jan 26 '16 at 20:54

1 Answers1

16

(I think posting this as an answer could benefit future readers who have interest in efficient coding.)


R is a vectorized language, thus operations by row are one of the most costly operations; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.

Hence, the first step is to reduce the "by" operations. By looking at your code, it seems that you are enlarging the size of your data set according to userID, start and end - all the rest of the operations could come afterwords (and hence be vectorized). Also, running seq (which isn't a very efficient function by itself) twice by row adds nothing. Lastly, calling explicitly seq.POSIXt on a POSIXt class will save you the overhead of method dispatching.

I'm not sure how to do this efficiently with dplyr, because mutate can't handle it and the do function (IIRC) always proved it self to be highly inefficient. Hence, let's try the data.table package that can handle this task easily

library(data.table) 
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)] 

Again, please note that I minimized "by row" operations to a single function call while avoiding methods dispatch


Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on.

Though, vectorizing isn't the end of story. We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the hourlydate and hournr using either different Date class functions or using format or maybe even substr. The trade off that needs to be taken in account is that, for instance, substr will be the fastest, but the result will be a character vector rather a Date one - it's up to you to decide if you prefer the speed or the quality of the end product. Sometimes you can win both, but first you should check your options. Lets benchmark 3 different vectorized ways of calculating the hournr variable

library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")

microbenchmark("format" = format(test, "%H"),
               "substr" = substr(test, 12L, 13L),
               "data.table::hour" = hour(test))

# Unit: microseconds
#             expr        min         lq        mean    median        uq       max neval cld
#           format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88   100  b 
#           substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27   100   c
# data.table::hour      5.121      7.681     23.9746     27.84     33.44     55.36   100 a  

data.table::hour is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation).

Now lets try 3 different ways for data.table::hour

microbenchmark("as.Date" = as.Date(test), 
               "substr" = substr(test, 1L, 10L),
               "data.table::as.IDate" = as.IDate(test))

# Unit: milliseconds
#                 expr       min        lq      mean    median        uq       max neval cld
#              as.Date  19.56285  20.09563  23.77035  20.63049  21.16888  50.04565   100  a 
#               substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895   100   b
# data.table::as.IDate  19.91964  20.44250  27.50989  21.34551  31.79939 145.65133   100  a 

Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate because of the integer storage mode.


Now that we know where both efficiency and quality lies, we could simply finish the task by running

res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]

(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL] which I'll leave to you)


There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient.

As a side note, if you want further to investigate data.table syntax/features, here's a good read

https://github.com/Rdatatable/data.table/wiki/Getting-started

David Arenburg
  • 91,361
  • 17
  • 137
  • 196