Alright. I've heavily edited this question to a) make it make more sense, and b) reflect where I'm at with the problem.
I have two data sets -- let's call them set1 and set2 -- each of about 6 million rows. Currently, I have them loaded into R as data.tables.
>set1<-data.table(read.csv('~/file1.csv', stringsAsFactors=F))
>setkey(set1, id1)
>head(set1)
id1 start_unixtime end_unixtime seconds_diff id2
1: 1674 1354741858 1354741858 0 227167461
2: 1674 1354752386 1354752951 565 227246263
3: 1674 1354764412 1354764412 0 227358796
4: 1674 1354773044 1354773776 732 227421295
5: 1674 1354778651 1354778651 0 227448774
6: 1674 1354810424 1354810424 0 227631113
>set2<-data.table(read.csv('~/file2.csv', stringsAsFactors=F))
>setkey(set2, id1)
>head(set2)
id1 unix_timestamp event_name
1: 1674 1355202784 join
2: 1674 1354351118 join
3: 1674 1354349648 play
4: 1674 1354780517 join
5: 1674 1355278891 join
6: 1674 1354617262 join
One problematic detail to point out: set2 has no unique keys. Only the tuple of each row is actually unique. In set1, id2 is unique. Fun times!
The operation I'm performing goes like this: for every row in set2
, I need to take the unix_timestamp, find the row in set1
where start_unixtimestamp <= unix_timestamp <= end_unixtimestamp
and id1 matches, then assign the corresponding set1.id2
to the appropriate row in set2
. Every row in set2
has an entry in set1
, but not every row in set1
has an entry in set2
. One id2
can be assigned to many rows in set2
. What I need to wind up with is this (NOTE: the following data is fake, as I haven't been able to produce any actual success yet.):
>head(set2)
id1 unix_timestamp event_name id2
1: 1674 1355202784 join 227167461
2: 1674 1354351118 join 227157309
3: 1674 1354349648 play 227157309
4: 1674 1354780517 join 227157309
5: 1674 1355278891 join 271089456
6: 1674 1354617262 join 221729485
Here is a mess of data table I've cooked up:
set2[, id2 := set1[set2[, id1], list(start_unixtime, end_unixtime, id2)][(start_unixtime <= unix_timestamp & unix_timestamp <= end_unixtime), id2, by=id2]][, list(id2)][, id2:= id2]
To talk through what I understand to be going on:
set2
calls the assignment operator:=
- The right hand side invokes
set1
, which starts byjoining
id1 rows from set2. - The columns
start_unixtime
,end_unixtime
, andid2
are selected. - From that result, a second set of selections are done, which gets
id2
where theutc_timestamp
ofid2
is betweenstart_unixtime
andend_unixtime
. - ...and here, I think I'm doing something badly wrong -- because at this step, I always seem to have two columns, each labeled
id2
and containing identical results. So, I select one column... - ...and specify it for assignment. (I do not know why this is done twice. I found this SO post, which uses this second
:=
, and this one which does not, and I have simply no idea why.
...which does not work. @mnel has proposed something like:
set2[set1, nomatch=0][unix_timestamp %between c(start_unixtime, end_unixtime, incbounds=T)]
...which works when I try it with his test data, but not with my data. It occurs to me that my data might be of some type (character?) which data.table
(or R period) might not be coercing properly? I might be dense, but I can't seem to figure out how to call as.integer()
on a specified column of a data.table
.
Edit: yeah, my data was all character, and I forgot that data.table
inherits from data.frame
. So, a little set1$start_unixtime <- as.integer($set1$start_unixtime)
and at least I'm sure everything is all integers. However, when I run that command, I still get this:
>head(set2)
Empty data.table (0 rows) of 8 cols: id1,utc_timestamp,event_name,start_unixtime,end_unixtime,seconds_diff...
Addition Here are snippets of my actual data:
set1 <- as.data.table(list(id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L),
start_unixtime = c(1354741858L, 1354752386L, 1354764412L, 1354773044L, 1354778651L, 1354810424L),
end_unixtime = c(1354741858L, 1354752951L, 1354764412L, 1354773776L, 1354778651L, 1354810424L),
seconds_diff = c(0L, 565L, 0L, 732L, 0L, 0L),
id2 = c(227167461L, 227246263L, 227358796L, 227421295L, 227448774L, 227631113L))
set2 <- as.data.table(list(
id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L),
utc_timestamp = c(1354752431L, 1354780517L, 1354811978L, 1354824385L, 1354833271L, 1354862753L),
event_name = c("joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2","joinRegularTable_2", "joinRegularTable_2"))