10

I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:

Data

Given two data.tables (dt and dt_lookup)

library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n), 
                 thisTime=sample(t, n, replace=TRUE), 
                 thisLocation=sample(la,n,replace=TRUE),
                 finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)

set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
                        lkpTime=sample(t, 10000, replace=TRUE),
                        lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)

I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)

Function

## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){

  ## filter lookup based on thisLocation and finalLocation,
  ## and only return values where the lkpId has both 'this' and 'final' locations
  tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
  tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
  availServices <- tempThis[tempThis %in% tempFinal]

  tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]

  ## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
  temp2 <- thisTime - tempThisFinal$lkpTime

  ## take the lkpId with the minimum non-negative difference
  selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
  selectedId
}

Attempts at a solution

I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:

selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]

However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:

lkpIds <- apply(dt, 1, function(x){
  thisLocation <- as.character(x[["thisLocation"]])
  finalLocation <- as.character(x[["finalLocation"]])
  thisTime <- as.numeric(x[["thisTime"]])
  myId <- getId(thisTime, thisLocation, finalLocation)
})

both taking ~30 seconds for n = 10,000.

Question

Is there a better way of using data.table to apply the getId function over each row of dt ?

Update 12/08/2015

Thanks to the pointer from @eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.

tospig
  • 7,762
  • 14
  • 40
  • 79
  • 2
    I would advice to minimize the example data, if you manage to show the problem on 10-20 rows you will get much more user able to investigate the problem. Additionally your current solutions raises multiple warnings on my machine. So having small example data you can also post expected output. – jangorecki Aug 11 '15 at 10:48
  • @jangorecki My questions isn't about a problem with the code or function *per se*, it's asking if there's a better way to use `data.table` on a large data set. For this example the warnings can just be ignored (they are where the function couldn't find an answer- which is ok). – tospig Aug 11 '15 at 11:11
  • 1
    `data.table` is not going to magically speed up the exact same loop. You should instead rethink your algorithm - the find nearest time is easily accomplished via rolling joins, but I'm not sure what to make of your initial "filter" operation. – eddi Aug 11 '15 at 17:31
  • @eddi I suppose the initial 'filter' is a hang-up from initial attempts to reduce the amount of data being 'looked-up' (and initially merged on) in each iteration of the `apply`. thanks for the prompt about joins; I'll have a re-think about the algorithm as a whole – tospig Aug 11 '15 at 21:12

1 Answers1

2

Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to @eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.

One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.

And, there will no doubt be better ways of programming this, but here's my attempt.

## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'

## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)

dt_this <- dt[dt_lookup, {
  idx = thisTime - i.lkpTime > 0
  .(id = id[idx],
    lkpId = i.lkpId,
    thisTime = thisTime[idx],
    lkpTime = i.lkpTime)
},
by=.EACHI]

## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]

## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]

## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)

dt_join <- dt_this[dt_final, nomatch=0]

## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]

dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]  

## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
#   group_by(id) %>%
#   arrange(timeDiff) %>%
#   slice(1) %>%
#   ungroup 
Community
  • 1
  • 1
tospig
  • 7,762
  • 14
  • 40
  • 79