4

I've been working on an exercise where I'm needing to calculate max / mins with a variable window length on some large datasets (~100 - 250 million rows).

In short, I have a table showing the start and end index (denoted by "Lookup_table" below) which reflect the row positions of a second table (referred to as "Price_table" below). Using those row positions, I then need to extract the maximum and minimum of a particular column in "Price_table". I need to repeat this for all rows of the "Lookup" table.

For example, if the first row of "Lookup_table" had Start = 1 and End = 5, I would need to find the maximum / minimum of the target column in Price_table from rows 1 to 5 inclusive. Then, if the second column had Start = 6, End = 12, I'd then find the max / min within Price_table from rows 6 to 12, and so on and so forth.

I've created a dummy set of data below with 10,000 rows (apologies for all the package requirements).

require(data.table)
require(dplyr)
require(purrr)
   
# Number of rows
nn <- 10000
# Create a random table of Price data with 1,000,000 rows
Price_table <- data.table(Price = runif(nn,300,1000)) %>% mutate(.,Index = seq_len(nrow(.)))

# Create a lookup table with start / end indexes
Lookup_table <- data.table(Start = floor(sort(runif(nn,1,nn)))) %>% mutate(.,End = shift(Start,fill = nn,type = "lead"))

I initially calculated the max / mins using the following lines of code. Unfortunately, I found that it failed on the really large datasets as it ran out of memory (I have 64 Gig of ram).

# Option 1: Determine the max / min between the Start / End prices in the "Price_table" table's "Price" column
Lookup_table[,(c("High_2","Low_2")) := rbindlist(pmap(.l = list(x = Start,y = End),
                       function(x,y) data.table(Max = max(Price_table$Price[x:y],na.rm = T),
                                                Min = min(Price_table$Price[x:y],na.rm = T))))]

I haven't re-tested the following alternate option on the full data-set yet, but based on some smaller sets of data, it seemed to be more memory efficient (well, at least using memory.size(), which may or may not provide an accurate reflection...).

# Option 2: Use mapply separately to determine the max / min between the Start / End prices in the "Price_table" table's "Price" column
Lookup_table[,High := mapply(function(x,y) max(Price_table$Price[x:y],na.rm = T),Start,End)]
Lookup_table[,Low := mapply(function(x,y) min(Price_table$Price[x:y],na.rm = T),Start,End)]

I have two questions:

  1. If I'm correct in saying that the mapply approach is more memory efficient (not in general, but at least relative to my first attempt), can someone please explain why that is the case? Is it, for example, because of the use of the rbindlist() + data.table() calls in the first attempt?
  2. Are there any other, memory-efficient (and faster?), approaches that I should consider when dealing with larger data sets?

Thanks in advance. Phil

Phil
  • 261
  • 1
  • 8
  • 1
    Side note: `library`-vs-`require`, https://stackoverflow.com/a/51263513/3358272. If you use `require`, check its return value; if you do not check it, the script will happily continue even if the package is not available, rendering the script fragile for distribution. – r2evans Aug 30 '21 at 11:34
  • 1
    Thanks for that - I didn't know that! – Phil Aug 31 '21 at 03:22
  • I think you want frollmin(adaptive=TRUE) but min (and max) are not yet implemented. – jangorecki Sep 05 '21 at 20:36

1 Answers1

5

You could use non-equijoins :

Price_table[Lookup_table,.(Price,Start,End),on=.(Index>=Start,Index<=End)][
            ,.(Low = min(Price), High = max(Price)),by=.(Start,End)]

      Start   End      Low     High
   1:     3     5 668.3308 908.1017
   2:     5     5 908.1017 908.1017
   3:     5     6 333.3477 908.1017
   4:     6     7 333.3477 827.1258
   5:     7     8 785.8887 827.1258
  ---                              
8947:  9993  9995 395.8449 827.7860
8948:  9995  9995 827.7860 827.7860
8949:  9995  9997 418.7436 827.7860
8950:  9997  9999 418.7436 947.1398
8951:  9999 10000 489.3145 634.6268
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • 1
    i think it can be within `[]`: `Price_table[Lookup_table, on=.(Index>=Start, Index<=End), by=.EACHI, as.list(range(x.Price))]`. unless if there are dupes in the `Lookup_table` – chinsoon12 Aug 30 '21 at 09:39
  • @chinsoon12, thanks for the suggestion. There might be dupes in the lookup table as results aren't exactly the same on above example. – Waldi Aug 30 '21 at 11:54
  • Thank you very much for the quick response. I'll look into non-equijoins in more detail. Really appreciate it! – Phil Aug 31 '21 at 03:21