5

I have a tibble that includes a list-column with vectors inside. I want to create a new column that accounts for the length of each vector. Since this dataset is large (3M rows), I thought to shave off some processing time using the furrr package. However, it seems that purrr is faster than furrr. How come?

To demonstrate the problem, I first simulate some data. Don't bother to understand the code in the simulation part as it's irrelevant to the question.


data simulation function

library(stringi)
library(rrapply)
library(tibble)

simulate_data <- function(nrows) {
  split_func <- function(x, n) {
    unname(split(x, rep_len(1:n, length(x))))
  }
  
  randomly_subset_vec <- function(x) {
    sample(x, sample(length(x), 1))
  }
  
  tibble::tibble(
    col_a = rrapply(object = split_func(
      x = setNames(1:(nrows * 5),
                   stringi::stri_rand_strings(nrows * 5,
                                              2)),
      n = nrows
    ),
    f      = randomly_subset_vec),
    col_b = runif(nrows)
  )
  
} 

simulate data

set.seed(2021)

my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine

my_data
## # A tibble: 3,000,000 x 2
##    col_a      col_b
##    <list>     <dbl>
##  1 <int [3]> 0.786 
##  2 <int [5]> 0.0199
##  3 <int [2]> 0.468 
##  4 <int [2]> 0.270 
##  5 <int [3]> 0.709 
##  6 <int [2]> 0.643 
##  7 <int [2]> 0.0837
##  8 <int [4]> 0.159 
##  9 <int [2]> 0.429 
## 10 <int [2]> 0.919 
## # ... with 2,999,990 more rows

the actual problem
I want to mutate a new column (length_col_a) that will account for the length of col_a. I'm going to do this twice. First with purrr::map_int() and then with furrr::future_map_int().

library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)

# first with purrr:
##################
tic()
my_data %>%
  mutate(length_col_a = map_int(.x = col_a, .f = ~length(.x)))

## # A tibble: 3,000,000 x 3
##    col_a      col_b length_col_a
##    <list>     <dbl>        <int>
##  1 <int [3]> 0.786             3
##  2 <int [5]> 0.0199            5
##  3 <int [2]> 0.468             2
##  4 <int [2]> 0.270             2
##  5 <int [3]> 0.709             3
##  6 <int [2]> 0.643             2
##  7 <int [2]> 0.0837            2
##  8 <int [4]> 0.159             4
##  9 <int [2]> 0.429             2
## 10 <int [2]> 0.919             2
## # ... with 2,999,990 more rows
toc()
## 6.16 sec elapsed


# and now with furrr:
####################
future::plan(future::multisession, workers = 2)

tic()
my_data %>%
  mutate(length_col_a = future_map_int(col_a, length))
## # A tibble: 3,000,000 x 3
##    col_a      col_b length_col_a
##    <list>     <dbl>        <int>
##  1 <int [3]> 0.786             3
##  2 <int [5]> 0.0199            5
##  3 <int [2]> 0.468             2
##  4 <int [2]> 0.270             2
##  5 <int [3]> 0.709             3
##  6 <int [2]> 0.643             2
##  7 <int [2]> 0.0837            2
##  8 <int [4]> 0.159             4
##  9 <int [2]> 0.429             2
## 10 <int [2]> 0.919             2
## # ... with 2,999,990 more rows
toc()
## 10.95 sec elapsed

I know tictoc isn't the most accurate way to benchmark, but still -- furrr is supposed to be just faster (as the vignette suggests), but it isn't. I've made sure that the data isn't grouped, since the author explained that furrr doesn't work well with grouped data. Then what other explanation could be for furrr being slower (or not very faster) than purrr?


EDIT


I found this issue on furrr's github repo that discusses almost the same problem. However, the case is different. In the github issue, the function being mapped is a user-defined function that requires attaching additional packages. So the author explains that each furrr worker has to attach the required packages before doing the calculation. By contrast, I map the length() function from base R, so practically there should be no overhead of attaching any packages.

In addition, the author suggests that problems may arise because plan(multisession) wasn't working in RStudio. But updating the parallelly package to dev version solves this problem.

remotes::install_github("HenrikBengtsson/parallelly", ref="develop")

Unfortunately, this update didn't make any difference in my case.

Emman
  • 3,695
  • 2
  • 20
  • 44
  • The culprit may be the very large size of your dataset. See: https://furrr.futureverse.org/#data-transfer – PaulS Nov 02 '21 at 19:55
  • @PaulSmith, yes I had seen this, but didn't think it applies in my case because the data is supposed to be split as part of the parallelization. [Here](https://furrr.futureverse.org/articles/articles/gotchas.html) the vignette says: `furrr [...] doing what it is good at - sharding the x column into equally sized groups and sending them off to the workers to process them in parallel.` If the size of my data is the problem, then I can hardly see in what situations could I benefit from `furrr`. – Emman Nov 02 '21 at 20:13
  • The benefit of parallelization comes from the possibility of separating _processing-intensive_ activities over several processors. If you add a time delay before calculating the length, then you will see that `furrr` has large advantage over `purrr`. This suggests that the overhead may be coming from the management of the very large dataset (sending pieces to the workers). I have just tried with a delay of `0.000001` and the results are: `purrr --> 192.45 sec` and `furrr: 44.707 sec` (`8 workers`). – PaulS Nov 02 '21 at 20:50
  • @PaulSmith, thanks. That's interesting. Would you consider posting this as an answer so I could play with your code? Also for better visibility of your response. – Emman Nov 02 '21 at 21:39
  • Done so, @Emman ! – PaulS Nov 02 '21 at 23:00

1 Answers1

3

As I have argued in the comments to the original post, my suspicion is that there is an overhead caused by the distribution the very large dataset by the workers.

To substantiate my suspicion, I have used the same code used by the OP with a single modification: I have added a delay of 0.000001 and the results were: purrr --> 192.45 sec and furrr: 44.707 sec (8 workers). The time taken by furrr was only 1/4 of the one taken by purrr -- very far from 1/8!

My code is below, as requested by the OP:

library(stringi)
library(rrapply)
library(tibble)

simulate_data <- function(nrows) {
  split_func <- function(x, n) {
    unname(split(x, rep_len(1:n, length(x))))
  }
  
  randomly_subset_vec <- function(x) {
    sample(x, sample(length(x), 1))
  }
  
  tibble::tibble(
    col_a = rrapply(object = split_func(
      x = setNames(1:(nrows * 5),
                   stringi::stri_rand_strings(nrows * 5,
                                              2)),
      n = nrows
    ),
    f      = randomly_subset_vec),
    col_b = runif(nrows)
  )
  
} 

set.seed(2021)

my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine

my_data

library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)

# first with purrr:
##################

######## ---->  DELAY <---- ########
f <- function(x) {Sys.sleep(0.000001); length(x)}

tic()
my_data %>%
  mutate(length_col_a = map_int(.x = col_a, .f = ~ f(.x)))
toc()

plan(multisession, workers = 8)

tic()
my_data %>%
  mutate(length_col_a = future_map_int(col_a, f))
toc()
PaulS
  • 21,159
  • 2
  • 9
  • 26