I am doing different data processing operations over a big dataset (approx 50M rows per 14 columns, some of them being strings of no more than 14 characters) using arrow
. I have used open_dataset
to manipulate the data using arrow. After doing several operations, I am able to call compute
on a final dataset I use before some grouping and summarising operations within a for loop. Based on this dataset, the code is the following:
Note: summarise_numeric
is a custom function that summarises several variables
geo_groupings <- list(national = "",
region = "region",
zip_code = "Zip Code")
var_groupings <- list(unsegmented = "",
no_gender = c("Age"),
no_age = c("Gender"),
all_variables = c("Gender", "Age"))
sales <- sales %>% compute()
# 3.2 Variables creation:
for (geo_group in names(geo_groupings)){
for (var_group in names(var_groupings)){
groups <- c("date", "customer")
if (var_group != "unsegmented") groups <- c(groups, var_groupings[[var_group]])
if (geo_group != "national") groups <- c(groups, geo_groupings[[geo_group]])
groups <- rlang::syms(groups)
stacked <- sales %>%
group_by(!!!groups[1:length(groups)]) %>%
summarise_numeric() %>%
mutate(Segment = tolower(paste(!!!groups[2:length(groups)], sep = "_"))) %>%
relocate(Segment, .after = "date") %>%
# We collect because there is not an implementation of tidyr::pivot_wider
# for the arrow package yet
collect()
Then I get the following error:
#> Error in `compute.arrow_dplyr_query()`:
#> ! Out of memory: realloc of size 8589934656 failed
The code works perfectly, the only issue is memory reallocation. I need to call collect
because then I need to apply pivot_wider
to stacked
and this is not implemented as of arrow
11.0.0.3.
An example of how I would approach this example with arrow::map_batches
in order to circumvent memory issues is the following:
map_batches(iris_dt, function(batch){
batch %>%
group_by(Species) %>%
summarise(across(where(is.numeric), sum))
}) %>%
collect()
For this particular example, it has the same result as
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), sum)) %>%
collect()
However, when calling the same code in my big dataset
stacked <- arrow::map_batches(sales, function(batch){
batch %>%
group_by(!!!groups[1:length(groups)]) %>%
summarise_numeric() %>%
mutate(customer = tolower(paste(!!!groups[2:length(groups)], sep = "_"))) %>%
relocate(customer, .after = "date")
}) %>%
collect()
The result that I get is not perfectly aggregated as you would expect from a group_by
and summarise
operation. I am wondering whether using the .lazy = FALSE
argument would solve this issue.