I've been working on an exercise where I'm needing to calculate max / mins with a variable window length on some large datasets (~100 - 250 million rows).
In short, I have a table showing the start and end index (denoted by "Lookup_table" below) which reflect the row positions of a second table (referred to as "Price_table" below). Using those row positions, I then need to extract the maximum and minimum of a particular column in "Price_table". I need to repeat this for all rows of the "Lookup" table.
For example, if the first row of "Lookup_table" had Start = 1 and End = 5, I would need to find the maximum / minimum of the target column in Price_table from rows 1 to 5 inclusive. Then, if the second column had Start = 6, End = 12, I'd then find the max / min within Price_table from rows 6 to 12, and so on and so forth.
I've created a dummy set of data below with 10,000 rows (apologies for all the package requirements).
require(data.table)
require(dplyr)
require(purrr)
# Number of rows
nn <- 10000
# Create a random table of Price data with 1,000,000 rows
Price_table <- data.table(Price = runif(nn,300,1000)) %>% mutate(.,Index = seq_len(nrow(.)))
# Create a lookup table with start / end indexes
Lookup_table <- data.table(Start = floor(sort(runif(nn,1,nn)))) %>% mutate(.,End = shift(Start,fill = nn,type = "lead"))
I initially calculated the max / mins using the following lines of code. Unfortunately, I found that it failed on the really large datasets as it ran out of memory (I have 64 Gig of ram).
# Option 1: Determine the max / min between the Start / End prices in the "Price_table" table's "Price" column
Lookup_table[,(c("High_2","Low_2")) := rbindlist(pmap(.l = list(x = Start,y = End),
function(x,y) data.table(Max = max(Price_table$Price[x:y],na.rm = T),
Min = min(Price_table$Price[x:y],na.rm = T))))]
I haven't re-tested the following alternate option on the full data-set yet, but based on some smaller sets of data, it seemed to be more memory efficient (well, at least using memory.size(), which may or may not provide an accurate reflection...).
# Option 2: Use mapply separately to determine the max / min between the Start / End prices in the "Price_table" table's "Price" column
Lookup_table[,High := mapply(function(x,y) max(Price_table$Price[x:y],na.rm = T),Start,End)]
Lookup_table[,Low := mapply(function(x,y) min(Price_table$Price[x:y],na.rm = T),Start,End)]
I have two questions:
- If I'm correct in saying that the mapply approach is more memory efficient (not in general, but at least relative to my first attempt), can someone please explain why that is the case? Is it, for example, because of the use of the rbindlist() + data.table() calls in the first attempt?
- Are there any other, memory-efficient (and faster?), approaches that I should consider when dealing with larger data sets?
Thanks in advance. Phil