How to join, group and summarise large dataframes in R with multidplyr and parallel

Question

This question is similar to other problems with very large data in R, but I can't find an example of how to merge/join and then perform calculations on two dfs (as opposed to reading in lots of dataframes and using mclapply to do the calculations). Here the problem is not loading the data (takes ~20 min but they do load), but rather the merging & summarising.

I've tried all data.table approachesI could find, different types of joins, and ff, and I still run into the problem of vecseq limits 2^31 rows. Now I'm trying to use multidplyr to do it in parallel, but can't figure out where the error is coming from.

Dataframes: species_data # df with ~ 65 million rows with cols <- c("id","species_id") lookup # df with ~ 17 million rows with cols <- c("id","cell_id","rgn_id") Not all ids in the lookup appear in the species_data

## make sample dataframes:

lookup <- data.frame(id = seq(2001,2500, by = 1), 
                     cell_id = seq(1,500, by = 1), 
                     rgn_id = seq(801,1300, by = 1))

library(stringi)

species_id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 3, pattern = "[A-Z]"),
                      pattern = "-",
                      stri_rand_strings(1000, length = 5, '[1-9]'))

id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 1, pattern = "[2]"),
                    stri_rand_strings(n = 1000, length = 1, pattern = "[0-4]"),
                    stri_rand_strings(n = 1000, length = 1, pattern = "[0-9]"))

species_data <- data.frame(species_id, id)

merge and join dfs with multidplyr

library(tidyverse)
install.packages("devtools")
library(devtools)
devtools::install_github("hadley/multidplyr") 
library(multidplyr)
library(parallel)

species_summary <- species_data %>%
  # partition the species data by species id
  partition(species_id, cluster = cluster) %>%
  left_join(species_data, lookup, by = "id") %>%
  dplyr::select(-id) %>%
  group_by(species_id) %>%
  ## total number of cells each species occurs in
  mutate(tot_count_cells = n_distinct(cell_id)) %>%
  ungroup() %>%
  dplyr::select(c(cell_id, species_id, rgn_id, tot_count_cells)) %>%
  group_by(rgn_id, species_id) %>% 
  ## number of cells each species occurs in each region
  summarise(count_cells_eez = n_distinct(cell_id)) %>% 
  collect() %>%
  as_tibble()
## Error in partition(., species_id, cluster = cluster) : unused argument (species_id)

## If I change to:
species_summary <- species_data %>%
  group_by(species_id) %>%
  partition(cluster = cluster) %>% ...
## get, "Error in worker_id(data, cluster) : object 'cluster' not found

This is my first attempt at parallel and big data and I'm struggling to diagnose the errors.

Thanks!

Can you provide a small part or a description of your data? It would be useful to understand your problem. — agila, Mar 17 '20 at 10:47
Hi. I just tried running your example with the sample data frames, but the problem is that after running `filter(!is.na(rgn_id))` I get no observation because none of the `rgn_id` is not NA. Could you update the sample dataframe or maybe share a portion of the original ones? — agila, Mar 18 '20 at 15:39
@agila here's a better representation of the lookup data: lookup <- data.frame(id = seq(2001,2500, by = 1), cell_id = seq(1,500, by = 1), rgn_id = seq(801,1300, by = 1)) the rgn_ids are just 3 digit numbers. I just edited the code to remove that line, because I can actually do it before the join to reduce the size of the dataframe — leslie roberson, Mar 20 '20 at 00:38
Another question: in the code you report `group_by(rgn_id_2)` but what is `rgn_id_2` ? — agila, Mar 20 '20 at 16:30
@agila good catch - it's a mistake, should be rgn_id - corrected it in the question. I've been messing with this data trying so many different ways and I had reformatted the rgn_id column at one point — leslie roberson, Mar 21 '20 at 11:38
I also tried using the disk.frame() package, and splitting it up into multiple steps, but still can't get the join() to work — leslie roberson, Mar 21 '20 at 11:40

score 4 · Answer 1 · answered Mar 21 '20 at 16:15

First I load dplyr and multidplyr

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(multidplyr)
my_clusters <- new_cluster(3) # I have 4 cores

then I load the same data that you propose

library(stringi)
lookup <- tibble(
  id = as.character(seq(2001, 2500, by = 1)),
  cell_id = seq(1, 500, by = 1),
  rgn_id = sprintf("%s", stri_rand_strings(n = 500, length = 3, pattern = "[0-9]"))
)

species_id <- sprintf(
  "%s%s%s", 
  stri_rand_strings(n = 1000, length = 3, pattern = "[A-Z]"),
  pattern = "-",
  stri_rand_strings(n = 1000, length = 5, "[1-9]")
)
id <- sprintf(
  "%s%s%s", 
  stri_rand_strings(n = 1000, length = 1, pattern = "[2]"),
  stri_rand_strings(n = 1000, length = 1, pattern = "[0-4]"),
  stri_rand_strings(n = 1000, length = 1, pattern = "[0-9]")
)

species_data <- tibble(species_id, id)

Check the result

species_data
#> # A tibble: 1,000 x 2
#>    species_id id   
#>    <chr>      <chr>
#>  1 CUZ-98293  246  
#>  2 XDG-61673  234  
#>  3 WFZ-94338  230  
#>  4 UIH-97549  226  
#>  5 AGE-35257  229  
#>  6 BMD-75361  249  
#>  7 MJB-78799  226  
#>  8 STS-15141  225  
#>  9 RXD-18645  245  
#> 10 SKZ-58666  243  
#> # ... with 990 more rows
lookup
#> # A tibble: 500 x 3
#>    id    cell_id rgn_id
#>    <chr>   <dbl> <chr> 
#>  1 2001        1 649   
#>  2 2002        2 451   
#>  3 2003        3 532   
#>  4 2004        4 339   
#>  5 2005        5 062   
#>  6 2006        6 329   
#>  7 2007        7 953   
#>  8 2008        8 075   
#>  9 2009        9 008   
#> 10 2010       10 465   
#> # ... with 490 more rows

Now I can run the code using a multidplyr approach. I divide the dplyr code in two steps according to the two group_by(s)

first_step <- species_data %>% 
  left_join(lookup, by = "id") %>% 
  select(-id) %>% 
  group_by(species_id) %>% 
  partition(my_clusters) %>% 
  mutate(tot_count_cells = n_distinct(cell_id)) %>% 
  collect() %>% 
  ungroup()
first_step
#> # A tibble: 1,000 x 4
#>    species_id cell_id rgn_id tot_count_cells
#>    <chr>        <dbl> <chr>            <int>
#>  1 UIH-97549       NA <NA>                 1
#>  2 BMD-75361       NA <NA>                 1
#>  3 STS-15141       NA <NA>                 1
#>  4 RXD-18645       NA <NA>                 1
#>  5 HFI-78676       NA <NA>                 1
#>  6 KVP-45194       NA <NA>                 1
#>  7 SGW-29988       NA <NA>                 1
#>  8 WBI-79521       NA <NA>                 1
#>  9 MFY-86277       NA <NA>                 1
#> 10 BHO-37621       NA <NA>                 1
#> # ... with 990 more rows

and

second_step <- first_step %>% 
    group_by(rgn_id, species_id) %>% 
    partition(my_clusters) %>% 
    summarise(count_cells_eez = n_distinct(cell_id)) %>% 
    collect() %>% 
    ungroup()
second_step
#> # A tibble: 1,000 x 3
#>    rgn_id species_id count_cells_eez
#>    <chr>  <chr>                <int>
#>  1 <NA>   ABB-24645                1
#>  2 <NA>   ABY-98559                1
#>  3 <NA>   AEQ-42462                1
#>  4 <NA>   AFO-58569                1
#>  5 <NA>   AKQ-44439                1
#>  6 <NA>   AMF-23978                1
#>  7 <NA>   ANF-49159                1
#>  8 <NA>   APD-85367                1
#>  9 <NA>   AQH-64126                1
#> 10 <NA>   AST-77513                1
#> # ... with 990 more rows

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

I think this works great for sample dataframes, but at first_step I'm still getting the error, "Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches, environment()) : negative length vectors are not allowed," which I googled and means the merge results in > 2^31 rows — leslie roberson, Mar 24 '20 at 00:45
If you can't share the original dataframes then I think I can't do anything more than that example, sorry. — agila, Mar 24 '20 at 08:31
btw: did you check [dbplyr](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html)? — agila, Mar 24 '20 at 08:33
I'm happy to share them, here's a dropbox link if you want to have a go: https://www.dropbox.com/s/fh0m1aofn0n04xh/xbd_troubleshooting.7z?dl=0. I was thinking they'd be too big. — leslie roberson, Mar 24 '20 at 11:30
I didn't know about dbplyr, I'll have a look. I see you have to set up a connection to the database - I'm on a university database with restricted access, so not as easy as SQLite for example, but it looks promising and I'll explore. Thanks for the tip — leslie roberson, Mar 24 '20 at 11:34
Thanks for sharing the data! I will check your example again during the weekend. — agila, Mar 25 '20 at 08:56
I'm sorry but I couldn't solve the problem. Maybe you can ask a new question [here](https://github.com/Rdatatable/data.table/issues) reporting the data.table error and asking for help. — agila, Apr 02 '20 at 12:58
Thanks for trying @agila, I think it's just a limitation with massive data. I voted for your answer because the code works well for pretty large merges. I found a way to manually break the whole analysis into smaller steps, was tedious but I was eventually able to get the results using multidplyr and parallel as you suggested — leslie roberson, Apr 06 '20 at 03:20

How to join, group and summarise large dataframes in R with multidplyr and parallel

1 Answers1