Efficient Match/Lookup in R

Question

Starting with 2 objects: 1 data frame of order attributes - Order Numbers, Weights and Volumes, and 1 list - combination strings of Order Numbers.

attr <- data.frame(Order.No = c(111,222,333), Weight = c(20,75,50), Volume = c(10,30,25))
combn <- list(111, 222, 333, c(111,222), c(111,333), c(222,333), c(111,222,333))

The objective is to find the total weight and cube for each string of orders, and keep only the combinations that are within both the weight and cube constraints.

I'm currently using the following -

# Lookup weights for each Order.No in the attr table
# Add up total weight for the combination and keep it if it's in the range
wgts <- lapply(combn, function(x) {
    temp <- attr$Weight[match(x, attr$Order.No)]
    temp <- sum(temp)
    temp[temp <= 50 & temp >= 20]
})
> wgts
[[1]]
[1] 20

[[2]]
numeric(0)

[[3]]
[1] 50

[[4]]
numeric(0)

[[5]]
numeric(0)

[[6]]
numeric(0)

[[7]]
numeric(0)

# Lookup volumes for each Order.No in the attr table
# Add up total volume for the combination and keep it if it's in the range
vols <- lapply(combn, function(x) {
    temp <- attr$Volume[match(x, attr$Order.No)]
    temp <- sum(temp)
    temp[temp <= 50 & temp >= 10]
})
> vols
[[1]]
[1] 10

[[2]]
[1] 30

[[3]]
[1] 25

[[4]]
[1] 40

[[5]]
[1] 35

[[6]]
numeric(0)

[[7]]
numeric(0)

Then use mapply to merge the two lists of weights and volumes.

# Find and keep only the rows that have both the weights and volumes within their ranges  
which(lapply(mapply(c, wgts, vols), function(x) length(x)) == 2)

# Yields position 1 and 3 which meet the subsetting conditions    
> value value 
    1     3

The code above looks up the individual order weights and cubes, sums them all together, checks to make sure they are within each range limit, merges both lists together and keeps only those that have both the weight and cubes within the acceptable ranges.

My current solution, which successfully completes the task, is terribly slow on production volume and does not scale well with millions of records. With 11 MM order combinations to lookup, this process takes ~40 minutes to run, which is unacceptable.

I'm seeking a more efficient method that will drastically reduce the run-time required to produce the same output.

What is `c` in your example? (last line of the example code) — talat, Oct 17 '15 at 17:45
Perhaps `library(data.table);library(fastmatch);setDT(melt(combn))[,{tmp <- sum(attr$Weight[fmatch(value, attr$Order.No)]); tmp[tmp <=50 & tmp >=20]}, L1]` — akrun, Oct 17 '15 at 17:48
Maybe you don't need to look up all 11MM order combinations...? There's clearly some combinatorial problem you've decided to solve by enumerating a large number of combinations. — Frank, Oct 17 '15 at 18:14
Sorry, I'm not sure where all of the confusion is coming from. "...data to play with" is exactly what I've already given above. I've listed reproducible code that works. My "desired output" is exactly what this code yields, but in a more computationally efficient manner is what I'm looking for. The "c" above is my understanding of how to use the mapply to join two lists together. I'm all ears if there's a better way. — Brian, Oct 17 '15 at 19:11
I'm essentially building a brute force optimizer that considers all possible combinations of variable, and then I need to select the most desirable combination. I've tried optimization packages, but am unable to find a solution using this method due to upstream constraints, so I ended up building my own. — Brian, Oct 17 '15 at 19:14
@akrun Thank you for the effort, but unfortunately it yields approximately the same run time as my code. — Brian, Oct 17 '15 at 19:58
@Brian Your code surely does not work, but I get the idea. Maybe you should look at that `as.data.frame(c)` part that Shawn asked about, since you have not provided any object named `c` (which, like `combn` and `attr` happens to be the name of a function and so probably should be avoided). — Frank, Oct 17 '15 at 20:59
@Frank Agreed. I need to revisit my naming conventions. I was using the 'c' as a concatenate function for the two lists. Thank you all for the suggestions. — Brian, Oct 17 '15 at 21:20
In your real data, is `combn` all possible combinations (taken 1, 2, ..., n at a time) of `Order.No` as well? — Arun, Oct 17 '15 at 21:53
@Arun Not necessarily 'all', but rather all possible combinations of 15. In my real data `combn` is a large list of all possible combinations of 15 Order.Nos. e.g. If I had 3 orders, `n=3` as in my example, it would be `(2^3)-1 = 7` (111, 222, 333, 111&222, 111&333, 222&333, 111&222&333). — Brian, Oct 17 '15 at 22:02
@Frank Sorry Frank, but I just noticed you said my code "does not work". I don't understand why you believe it doesn't. It works just fine for me, although inefficient. See my edits above. It prints out the only two combinations that satisfy the constraints. Do you just mean that my syntax is funny because of the `c` in mapply? Regardless, it should still print out correctly? — Brian, Oct 17 '15 at 22:18
Try starting a new R session and running the code above. Or just start a new R session and run `as.data.frame(c)` -- it will not work. I'm not sure why you thought it would :) Functions cannot be coerced to a data.frame. — Frank, Oct 17 '15 at 22:21
@Frank Ha. Yeah, the data.frame must have been a holdout from when I was mangling the code into shape. Edited above. `mapply(c, wgts, vols)` produces the correct output though. Since it's gotten kind of messy at this point, would you suggest I rethink my question and repost under a new thread, or should I continue on here? Thanks. — Brian, Oct 17 '15 at 22:29
Yeah, that might be a good idea. You might want to have a look at well-received [tag:r]+[tag:performance] questions to get an idea about what a good question looks like. Here are a couple: http://stackoverflow.com/q/33027611 and http://stackoverflow.com/q/31852294 You could also consider studying up on the data.table package before asking again, since it's very likely to play a role in the best answer to your question. Either way, I'd consider giving yourself another day or two to work through the problem and formulate the question before posting again, yeah. — Frank, Oct 17 '15 at 22:42

Frank · Answer 1 · 2015-10-17T18:01:02.887

4

# changing names, assigning indices to order list
atdf  = data.frame(Order.No = c(111,222,333), Weight = c(20,75,50), Volume = c(10,30,25))
olist = list(111, 222, 333, c(111,222), c(111,333), c(222,333), c(111,222,333))
olist <- setNames(olist,seq_along(olist))

# defining filtering predicate:

sel_orders = function(os, mins=c(20,10), maxs=c(50,50)) {
    tot = colSums(atdf[match(os, atdf$Order.No), c("Weight","Volume")])
    all(maxs >= tot & tot >= mins)
}

# Filtering orders

olist[sapply(olist, sel_orders)]
# or 
Filter(x = olist, f = sel_orders)

both of which give

# $`1`
# [1] 111
# 
# $`3`
# [1] 333

To change the maxes and mins...

olist[ sapply(olist, sel_orders, mins = c(0,0), maxs = c(70,70)) ]

# $`1`
# [1] 111
# 
# $`3`
# [1] 333
# 
# $`5`
# [1] 111 333

edited Oct 17 '15 at 18:01

answered Oct 17 '15 at 17:53

Frank

66,179
8
96
180

1

Again thanks, but this too yields essentially the same run time as my code. I picked up a few things from your approach however so I do appreciate that. – Brian Oct 17 '15 at 21:21
Yeah, I'm not surprised this isn't faster. For a real speedup, I suspect you'll need someone to post something along the lines of @akrun's comment/answer on data.table. For that to happen, you may need to post a somewhat bigger example, such that benchmarking can be done. Also (a minor thing), you might want to tag the question with [tag:performance]. – Frank Oct 17 '15 at 21:26

score 1 · Answer 2 · answered Oct 17 '15 at 18:08

1

Don't know how much faster this will be, but here's a dplyr/tidyr solution.

library(dplyr)
library(tidyr)

combination = 
  data_frame(Order.No = combn) %>%
  mutate(combination_ID = 1:n()) %>%
  unnest(Order.No)

acceptable = 
  combination %>%
  left_join(attr) %>%
  group_by(combination_ID) %>%
  summarize(total_weight = sum(Weight),
         total_volume = sum(Volume)) %>%
  filter(total_weight %>% between(20, 50) &
           total_volume %>% between(10, 50) )

answered Oct 17 '15 at 18:08

bramtayl

4,004
2
11
18

Nice solution, +1 ! I came up with something similar (but in a single chaining operation) and added `Order.No = toString(Order.No)` in the `summarize()` call to have the elements of the initial list in the final output. – Steven Beaupré Oct 17 '15 at 18:43
I kept the combination frame separate so that the order numbers could be accessed relationally, e.g, `acceptable %>% left_join(combination)` – bramtayl Oct 17 '15 at 18:51

Efficient Match/Lookup in R

2 Answers2