5

I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.

Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.

Example

A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)

dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))

lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
                 maxA=c(2e4,3e4,7e4,6e4,9e4),
                 minB=rep(2e3,5),
                 Val=seq(.1,.5,.1))

# Sample  Desired Value
     A     B    ID Val
99: 90000 10000 99 0.5

In SQL, I would then write something along the lines of

SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
                 and dt1.A<=lookup.maxA
                 and dt1.B>=lookup.minB
GROUP BY ID, A, B

Which would join all matching records from lookup to dt1 and return the smallest Val.

Update

My solution so far looks like:

CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
                                  list(Val=Val[which.min( Val)]),
                                  by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])

This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.

Steph Locke
  • 5,951
  • 4
  • 39
  • 77
  • 1
    note: using `CJ` (implemented in `data.table` package) would be faster than `expand.grid`. – Arun Feb 05 '14 at 20:51
  • see my edit for a minor tweak to your existing solution. – Dean MacGregor Feb 06 '14 at 16:36
  • Depending on the ranges of A and B you might consider manipulating `dt1` before cross joining like in this question http://stackoverflow.com/questions/16423817/create-dummy-variables-in-one-table-based-on-range-of-dates-in-another-table – Dean MacGregor Feb 06 '14 at 16:42
  • My last edit might be a stretch but you could certainly prefilter `dt1` like Senor O did ie. `dt1.prep<-dt1[A>=lookup[,min(minA)] & A<=lookup[,max(maxA)] & B>=lookup[,min(minB)]]` and then only crossjoin `dt1.prep` instead of the whole `dt1` – Dean MacGregor Feb 06 '14 at 16:47

2 Answers2

1

A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:

Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]

Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:

Indices = Prep[,list(min(which(A >= lookup$minA)), 
                     min(which(A <= lookup$maxA)), 
                     min(which(B >= lookup$minB)), A, B),by=ID]

Then you must get Val at the lowest point where all three conditions are satisfied:

Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]

See if this gets you what you're looking for:

   ID Val     A     B
 1: 19 0.1 10000  2000
 2: 20 0.1 20000  2000
 3: 21 0.2 30000  2000
 4: 22 0.3 40000  2000
 5: 23 0.3 50000  2000
 6: 24 0.3 60000  2000
 7: 25 0.3 70000  2000
 8: 26 0.5 80000  2000
 9: 27 0.5 90000  2000
10: 28 0.1 10000  3000
Señor O
  • 17,049
  • 2
  • 45
  • 47
  • Great idea! Will wrap my head around it and get back to you – Steph Locke Feb 05 '14 at 16:09
  • Thanks for this @Senor, unfortunately this looks to not being guaranteeing the minimum Val is selected when more than one lookup record matches. A cross-join is bad for performance in this instance but I took your solution and changed it somewhat based on another SO answer re: cross joins http://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r/14165493#14165493 – Steph Locke Feb 06 '14 at 11:45
1

My first thought was trying to make an index like Senor O did. However, the min(Val) made the index table tougher for me to think through. The way I thought to do it was to loop through the lookup table.

dt1[,Val:=as.numeric(NA)]
for (row in 1:NROW(lookup)) {
  dt1[A>=lookup[order(Val)][row,minA]&A<=lookup[order(Val)][row,maxA]&B>=lookup[order(Val)][row,minB]&is.na(Val),Val:=lookup[order(Val)][row,Val]]
  }

I think this should work because it first sets the new column with NA values.

Then it puts the lookup table in order by Val so you're going to get the lowest value of it.

At each loop it will only potentially changes values in dt1 if they were still NA in Val and since we're looping through lookup in order of smallest Val to biggest it will ensure you get the min(Val) that you wanted.

replace the rbind.fill line with

rbindlist(list(dt1.lookup,dt1[!ID %in% dt1.lookup[,ID]][,list(ID, A, B, Val=as.numeric(NA))]))

it will eliminate reliance on the reshape package and I think it'll be faster.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • 2
    I'm not sure the loop concept is feasible for my 800k records - will certainly take a look though – Steph Locke Feb 05 '14 at 16:55
  • 1
    I did a couple quick `system.time` tests and Senor's is about 3x faster when I increased `dt1` to 900900 rows but still under 1 sec for both. I was half way done writing mine when Senor's answer showed up, at least on my screen, so I felt committed to finishing. – Dean MacGregor Feb 05 '14 at 17:08
  • 1
    Thanks Dean, unfortunately this turned out to be too time consuming with my lookup table being more than 300 records and in my real task where I need to base the choice on a derived variable from Val it was proving beyond my skill to get it working as required. I posted my currently working solution, so if you have any further ideas, I'd be grateful to hear them – Steph Locke Feb 06 '14 at 11:47