I am trying to process a dataset that is larger than what my RAM allows. To this end, I am using arrow
to manipulate the dataset as outlined in this question. However, when doing the final processing my R session is not able to handle all memory requirements and crashes.
To avoid this, another user suggested using the arrow::map_batches
function to process the arrow dataset by batches. However, there are no examples in the documentation, and those of other vignettes do not properly explain the functioning either.
My intention is to process the whole dataset, but allowing the session to process it by batches instead of all at once and prevent R from crashing. This is the code that I have used:
# sales is an arrow Dataset
# groups is a set of variables I use for grouping
stacked <- arrow::map_batches(sales, function(batch){
batch %>%
group_by(!!!groups[1:length(groups)]) %>%
summarise_numeric() %>%
mutate(customer_type = tolower(paste(!!!groups[2:length(groups)], sep = "_"))) %>%
relocate(customer_type, .after = "date")
}) %>%
# We collect because there is not an implementation of tidyr::pivot_wider
# for the arrow package yet
collect()
variables <- stacked %>%
select(c("date", "customer_type") | where(is.numeric)) %>%
pivot_wider(id_cols = "date", names_from = "customer_type", values_from = where(is.numeric)) %>%
rename_with(~paste0("kpi_", .), .cols = where(is.numeric))
Even though now the process up to stacked
is able to finish, the table that is returned has more than one observation per group, it is as if the batches remain in the collected dataset; but I would like them to be completely grouped before creating the variables
dataframe.
Any help on how to use the arrow::map_batches
function is much appreciated.