Speed up 3-column R search in large data.frame

Question

Alright. I've heavily edited this question to a) make it make more sense, and b) reflect where I'm at with the problem.

I have two data sets -- let's call them set1 and set2 -- each of about 6 million rows. Currently, I have them loaded into R as data.tables.

>set1<-data.table(read.csv('~/file1.csv', stringsAsFactors=F))
>setkey(set1, id1)
>head(set1)
     id1 start_unixtime end_unixtime seconds_diff        id2
1:  1674     1354741858   1354741858            0  227167461
2:  1674     1354752386   1354752951          565  227246263
3:  1674     1354764412   1354764412            0  227358796
4:  1674     1354773044   1354773776          732  227421295
5:  1674     1354778651   1354778651            0  227448774
6:  1674     1354810424   1354810424            0  227631113
>set2<-data.table(read.csv('~/file2.csv', stringsAsFactors=F))
>setkey(set2, id1)
>head(set2)
     id1    unix_timestamp event_name
1:  1674    1355202784           join
2:  1674    1354351118           join
3:  1674    1354349648           play
4:  1674    1354780517           join
5:  1674    1355278891           join
6:  1674    1354617262           join

One problematic detail to point out: set2 has no unique keys. Only the tuple of each row is actually unique. In set1, id2 is unique. Fun times!

The operation I'm performing goes like this: for every row in set2, I need to take the unix_timestamp, find the row in set1 where start_unixtimestamp <= unix_timestamp <= end_unixtimestamp and id1 matches, then assign the corresponding set1.id2 to the appropriate row in set2. Every row in set2 has an entry in set1, but not every row in set1 has an entry in set2. One id2 can be assigned to many rows in set2. What I need to wind up with is this (NOTE: the following data is fake, as I haven't been able to produce any actual success yet.):

>head(set2)
     id1    unix_timestamp event_name         id2
1:  1674        1355202784       join   227167461
2:  1674        1354351118       join   227157309
3:  1674        1354349648       play   227157309
4:  1674        1354780517       join   227157309
5:  1674        1355278891       join   271089456
6:  1674        1354617262       join   221729485

Here is a mess of data table I've cooked up:

set2[, id2 := set1[set2[, id1], list(start_unixtime, end_unixtime, id2)][(start_unixtime <= unix_timestamp & unix_timestamp <= end_unixtime), id2, by=id2]][, list(id2)][, id2:= id2]

To talk through what I understand to be going on:

set2 calls the assignment operator :=
The right hand side invokes set1, which starts by joining id1 rows from set2.
The columns start_unixtime, end_unixtime, and id2 are selected.
From that result, a second set of selections are done, which gets id2 where the utc_timestamp of id2 is between start_unixtime and end_unixtime.
...and here, I think I'm doing something badly wrong -- because at this step, I always seem to have two columns, each labeled id2 and containing identical results. So, I select one column...
...and specify it for assignment. (I do not know why this is done twice. I found this SO post, which uses this second :=, and this one which does not, and I have simply no idea why.

...which does not work. @mnel has proposed something like:

set2[set1, nomatch=0][unix_timestamp %between c(start_unixtime, end_unixtime, incbounds=T)]

...which works when I try it with his test data, but not with my data. It occurs to me that my data might be of some type (character?) which data.table (or R period) might not be coercing properly? I might be dense, but I can't seem to figure out how to call as.integer() on a specified column of a data.table.

Edit: yeah, my data was all character, and I forgot that data.table inherits from data.frame. So, a little set1$start_unixtime <- as.integer($set1$start_unixtime) and at least I'm sure everything is all integers. However, when I run that command, I still get this:

>head(set2)
Empty data.table (0 rows) of 8 cols: id1,utc_timestamp,event_name,start_unixtime,end_unixtime,seconds_diff...

Addition Here are snippets of my actual data:

set1 <-  as.data.table(list(id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L), 
     start_unixtime = c(1354741858L, 1354752386L, 1354764412L, 1354773044L, 1354778651L, 1354810424L), 
     end_unixtime = c(1354741858L, 1354752951L, 1354764412L, 1354773776L, 1354778651L, 1354810424L), 
    seconds_diff = c(0L, 565L, 0L, 732L, 0L, 0L), 
    id2 = c(227167461L, 227246263L, 227358796L, 227421295L, 227448774L, 227631113L))
set2 <- as.data.table(list(
    id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L), 
    utc_timestamp = c(1354752431L, 1354780517L, 1354811978L, 1354824385L, 1354833271L, 1354862753L), 
    event_name = c("joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2","joinRegularTable_2", "joinRegularTable_2"))

if you know sql, [try monetdb](http://www.asdfree.com/2012/11/why-and-how-to-install-monetdb-with-r.html) — Anthony Damico, Dec 17 '12 at 08:19
@AnthonyDamico: I do know sql, thanks! I'll give that a look. — Gastove, Dec 17 '12 at 17:11
@James: I've updated my question to better address that, but: no. In set1, only id2 is unique. In set2, only the tuple of the row is unique. — Gastove, Dec 17 '12 at 17:14
Where does `utc_timestamp` come from? Should the first column in `set2` be called this? — mnel, Dec 18 '12 at 01:15
Oh jeeze. Yes. Thank you -- that's a column name from a different dataset entirely... — Gastove, Dec 18 '12 at 01:53
Post examples of the data you have. dput(head(set1)) and dput(head(set2)) — mnel, Dec 18 '12 at 20:51
Cool. n00b best practices question: is posting that kind of data example commonly good form in R Stack Overflow posts? I don't think I've seen that before, but I'm pretty new here. — Gastove, Dec 18 '12 at 23:52
Yes -- see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — mnel, Dec 18 '12 at 23:57
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/21352/discussion-between-mnel-and-gastove) — mnel, Dec 19 '12 at 00:47

mnel · Answer 1 · 2012-12-18T02:04:39.267

I'm not sure this will work on your data as you perhaps need to post a more complete example, but something like the following might work. It does 1 join (binary search) then one vector scan (which creates a couple of long logical vectors behind the scenes, so is not ideal)

I've provided a simple but larger example data set with a bit more replication.

DT <- as.data.table(list(id1 = c(5L, 1L, 5L, 1L, 5L, 3L, 5L, 3L, 1L, 3L), 
    id2 = 1:10, startunix = 1:10, endunix = 5:14))

DA <- as.data.table(list(id1 = c(3L, 5L, 5L, 5L), unixtime = c(5L, 1L, 6L, 12L)))

setkey(DA,id1)
setkey(DT,id1)


DT[DA, nomatch=0][unixtime %between% c(startunix, endunix)]

   id1 id2 startunix endunix unixtime
1:   5   1         1       5        6
2:   5   3         3       7        6
3:   5   5         5       9        6
4:   5   7         7      11        6

To explain what it is doing, it is matching by id1, and nomatch = 0 means that these are not included. This expands to all combinations of the multiple rows in DA[J(5)] and DT[J(5)] - in this case there

 DA[J(5)]
   id1 unixtime
1:   5        1
2:   5        6
3:   5       12
> DT[J(5)]
   id1 id2 startunix endunix
1:   5   1         1       5
2:   5   3         3       7
3:   5   5         5       9
4:   5   7         7      11

So the created merged data set contains all 12 combinations. ( 4 times 3)

I then use the function between ( part of the data.table package) to subset those values where unixtime is between startunix and endunix.

To the best of my knowledge you won't be able to use binary searches to find whether something is within a range (however @MatthewDowle, the main data.table package author is active on SO and may jump in here with a comment as to whether this is possible or likely in the future)

Ah, thanks so much for drawing my attention to `%between%` -- I hadn't found that yet. And thanks for the answer! Interestingly enough, I cannot get your code to work with my datasets, which seems to suggest my integers... well, aren't. I'm going to mess with that, and will report back (and maybe clean up my question a little...) — Gastove, Dec 18 '12 at 07:31
Yup, would be nice to do between joins in future ([FR203](https://r-forge.r-project.org/tracker/?group_id=240&atid=978&func=detail&aid=203)). Can be done now (manually) using a `roll` join to the start with `which=TRUE`, then another `roll` join to the end again with `which=TRUE`, followed by a `seq` or `vecseq` with `+/-1` as appropriate. That's likely much faster than the `%between%` vector scan. — Matt Dowle, Dec 18 '12 at 09:34
....uh, huh. Okay. I've found the documentation on `roll` and `which`; I think I see what's going on there. Are you proposing those as modifications to the monstrous command I cooked up, or the somewhat more elegant one @mnel created? Also... I confess that I don't follow what `seq` or `vecseq` get you. — Gastove, Dec 18 '12 at 15:36

Speed up 3-column R search in large data.frame

1 Answers1